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Abstract 

This is a research prototype developed with the objective to contribute towards further detailed 
actuarial investigations of lapse in Malaysia. In 2012, there were 860,316 new traditional individual life 
policies issued but at the same time surrenders and forfeitures net of revivals amounted to 420,644 
policies, which is almost 50% of new policies issued. Improvements in lapse models can lead to 
operational efficiencies in distribution and policy conservation, as well as increased accuracy of the 
parameterization of actuarial models, creating value for consumers and life insurers. 

In this prototype, factors affecting lapse of life insurance policies in Malaysia are investigated using 
statistical techniques from the generalized linear model family. The investigations perused publicly 
available summarized data, and hence can be freely published without concerns relating to data 
protection, but unfortunately such data lacks granularity required for very precise assessments. Issues 
central to generalized linear models such as multicollinearity of explanatory variables, interpretation of 
model results, over-dispersion, model selection procedures, model diagnostics, model lift, and most 
critically actuarial judgment on the reasonableness of the model, are discussed. 



The workings underlying this paper, including the R commands, are specified in detail. 




Section 1 
Background 

1.1 The objective of this paper is to contribute further 
towards actuarial research in the field of lapse 
modelling. Generalised linear models and similar 
predictive modelling techniques are currently 
widespread in the field of property and casualty 
insurance, but relatively at its infancy in the field of life 
insurance. 

1.2 As data quality, data granularity, computing power 
and product proliferation increases, predictive 
modelling techniques will emerge as a common 
currency for actuarial experience analysis in the very 
near future. An effective lapse model has the potential 
to contribute towards operational efficiencies in 
distribution and policy conservation, as well as 
increased accuracy of the parameterization of actuarial 
models. 

1.3 This paper extends existing research to an empirical 
study of the lapses in the Malaysian life insurance 
industry from 2008 to 2012. Poisson regression, 
negative binomial regression and binomial logistic 



regressions were applied to examine summarized 
industry-wide lapse data. 

1.4 Issues central to generalized linear models such as 
multicollinearity of explanatory variables, 
interpretation of model results, over-dispersion, model 
selection procedures, model diagnostics, model lift, 
and most critically actuarial judgment on the 
reasonableness of the model, are discussed. 

1.5 Whilst a main shortcoming of this model is the lack of 
granular data, a significant upside of using publicly 
available data is that the paper can be published 
without competitive and data protection concerns. 

1.6 The main aim of this paper is to contribute towards 
further research. It is written to with the goal of 
enabling the readers to understand and follow the 
workings, and ultimately expand research on this 
subject. 

1.7 Section 2 and Appendix A sets out the source of the 
data and the rationale behind calibrations performed 
on the data. The explanatory variables selected are 
discussed in Section 3. Basic features of the model are 
set out in Section 4. 
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1.8 



The exploratory analyses are carried out in Section 5, 
and multicollinearity is investigated in Section 6. 



1.9 Sections 7, 8, 9, 10 and 11 sets out the Poisson model, 
quasi-Poisson model, negative binomial model, 
binomial model and quasi-binomial model 
respectively. Models with company as the only 
explanatory variable are discussed in Section 12. 

1.10 The R commands are set out in Appendix B. 
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Section 2 
Data 

2.1 The data is extracted from the Annual Insurance 
Statistics of Bank Negara Malaysia from reporting 
years 2006 to 2012, grouped by individual companies 
for each reporting year on an annual basis. 

2.2 The reporting years are chosen to be the years in which 
the author is active in the Malaysian market, and hence 
the author in a better position to make any necessary 
actuarial judgments. 

2.3 Whilst the period of investigation selected is 2008 to 
2012, data was also collected for the years 2006 and 
2007 for explanatory variables which will be discussed 
in Section 3. 

2.4 This paper only investigates traditional individual life 
policies. Investment linked, group life and annuities 
are excluded. 

2.5 This is the only source of reliable lapse data in 
Malaysia which is publicly available. The main reason 
why this paper is a prototype and not a complete paper 
is because the format of the summarized data available 



significantly limits the explanatory and predictive 
power of the model. 

2.6 The summarized data are generally found to be 
reliable. 

2.7 The calibration of the raw data into data used as input 
for the analyses, are set out in Appendix A. 

2.8 75 observations are used for the analyses. 

2.9 Furthermore, as the objective of this prototype is to 
contribute to further research rather than as an end 
product to be used as a predictive decision making 
tool, the entire dataset is used for model calibration. 
Model validation using a segregated random data 
subset is not performed. 
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Section 3 

Explanatory Variables 

3.1 The selection of these explanatory variables is mainly 
driven by data availability. 

3.2 Life insurance company ("company") - this variable 
examines how differences in the business practice of 
different life insurance companies affect lapses. 

3.2.1 This variable is analysed in a model separate from the 
other explanatory variables. This is because the format 
of the available data, coupled with consistent business 
mix of life insurers over the investigation period, 
would have otherwise led to misleading model results 
due to multicollinearity. This is further discussed in 
Section 6. 

3.2.2 This variable is treated as a binary variable. The 
company with the largest exposure, company 7, is 
selected as the baseline. 

3.3 Time period ("year") - the time period is taken to be 
the reporting year i.e. the calendar year of the financial 
year end of the life insurance company. 



3.3.1 This is an exogenous variable to examine the external 
effects affecting lapses such as the general economic 
and commercial environment. Nonetheless, the annual 
frequency in which the summarized data is presented, 
coupled with the different financial year periods for 
different life insurance companies significantly reduces 
the explanatory power of this variable in this paper. 

3.3.2 This variable is treated as a binary variable, and the 
year with the largest exposure, year 2012, is selected as 
the baseline. 

3.4 Product type - whole life, term and others ("wl", "tm", 
"ot") - the proportion of inforce policies for each 
product type written by each life insurance company is 
used as a variable to examine the different lapse 
behaviour for different product types. 

3.4.1 The data available is summarized by whole life, 
endowment, term and others (which is understood to 
comprise mainly of standalone medical products). 

3.4.2 Data is not sufficiently granular to made further 
distinctions within a product class e.g. participating vs 
non-participating or anticipated endowment vs lump 
sum endowment, which are factors that could have a 
significant effect on lapse. 
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3.4.3 These variables are treated as continuous variables, 
with the proportion of endowment policies not 
explicitly considered and hence included as the 
baseline by default in a model that considers all these 
product type explanatory variables. 



Premium payment mode - single premium and regular 
premium ("sp") - the proportion of new business 
premium which is single premium, written by each life 
insurance company is used as a variable to examine the 
differences in lapse behaviour between single premium 
policies and regular premium policies. 



3.5.1 This, to a very large extent, assumes that the business 
mix between single premium and regular premium 
products is the same for new business and inforce 
business, as the data available does not capture the 
proportion of inforce business which is single 
premium. 



3.5.2 The summarized data also does not capture other 
premium payment features which could have a 
significant impact on lapse behaviour such as premium 
payment frequency and premium payment term. 



3.5.3 This variable is treated as a continuous variable, with 
regular premium included as the baseline by default in 
a model that considers this explanatory variable. 



Policy duration - first policy year, second policy year, 
third policy year ("dO", "dl", "d2") - the proportion of 
new business premiums written to the current year 
inforce premiums, is captured separately for the three 
most current years with respect to the reporting year. 



3.6.1 These variables are used in attempt to examine the 
lapse behaviour of different policy durations. Policy 
duration, in particular early policy duration is of 
particular interest in lapse analyses, as not only the 
higher incidence of lapse is typically evident due to 
churning, but furthermore it is typical that lapse at 
early policy duration results in higher financial 
implication for both the consumer and the life 
insurance company. 



3.6.2 These variables are treated as continuous variables, 
with policy duration 3 years and more included as the 
baseline by default in a model that considers all these 
policy duration explanatory variables. 




Section 4 



£00 = m = g-\xp) 



Model 

4.1 In Sections 7, 8 and 9, this paper considers the Poisson, 
the quasi-Poisson and the negative binomial regression 
models, where lapse is modeled as a count variable. In 
Sections 10 and 11, this paper considers the binomial 
and the quasi-binomial logistic regression models, 
where lapse is modeled as a binary variable. These are 
statistical models classified under the generalized 
linear model (GLM) family. 

4.2 The following merely sets out a brief background of 
the GLM theory, there are rich and publicly available 
literature setting out the detailed theory and 
applications of GLM, and hence it is not necessary to 
reproduce them in this paper. Nonetheless, readers 
who are not familiar with the subject of GLM would 
benefit from reading this paper in conjunction with a 
review of basic GLM literatures. 

4.3 GLMs assume the response variable, Y, is generated 
from a distribution in the exponential family, where 
the mean of the distribution, p, depends on the 
explanatory variables X. Formulaically this is given by 



where X|3, a linear combination of the parameters /?, is 
the linear predictor. 

4.4 The link function g provides the relationship between 
p and X|3. 

4.4.1 The link function used for the Poisson and quasi- 
Poisson models is the log link function, which is also 
the canonical link function, and is given byX/? = 
log(p). 

4.4.2 In this paper, the log link function is also used for the 
negative binomial model. The log link function is a 
commonly used link function for negative binomial 
models. 

4.4.3 The link function used for the binomial model is the 
logit link function, given by X /? = log(-X-). This is the 

canonical link function for the binomial distribution. 

4.4.4 The (3's are estimated by maximum likelihood 
techniques using an iteratively reweighted least 
squares algorithm. 

4.5 The Poisson model and negative binomial model can 
be expressed as p = e~ x P , in a multiplicative form. 
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4.6 The binomial model can be expressed as [i = 1+ ^- x p ■ 

4.7 The modelling is performed in RStudio, a free and 
open source integrated development environment for 
R. 

4.8 This paper relies on commonly used statistical 
packages on R. The R commands are set out in 
Appendix 2. 
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Section 5 



Exploratory Analysis 



5.1 A typical first step in modelling is to perform 
exploratory analyses on the observations. 



5.2 In the graphs below, the dotted lines represent the 
average lapse rates or log(lapse rates). The solid line in 
G5.1 is a density curve, the solid line in G5.2 is fitted by 
simple linear regression and the solid lines in the other 
graphs are fitted by local regression. 




5.3 A histogram and density plot of the lapse rates is 
shown in G5.1. Lapse rates are observed to have a 
positive skew, and the median is observed to be higher 
than the mean. 

5.4 G5.2 shows a positive linear relationship between 
log(lapse) and log(exposure), with several outliers 
having a lower log(lapse) for a given log(exposure). 




5.5 G5.3 and G5.4 in the next page shows how lapse rates 

and the log(lapse rates) differs between the companies. 

5.5.1 Company 7 will be modeled as the baseline company 
in the company only model in Section 12. 






1 2 3 ^ 5 6 7 8 Q 10 11 12 13 14 15 

company 



G5.4: Boxplot of log(lapse rates) and company 
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5.5.2 The coefficients of most other companies is expected to 
be statistically significant, with the exception of 
company 3 and company 6, which has lapse rates not 
too dissimilar to those of company 7. 

5.6 G5.5 shows that, with the exception of few outliers, 

lapse rates do not change significantly over the 5 years 
investigation period. 

G5.5: Boxplot of log(lapse rates) and year 




2008 2009 2010 2011 2012 

year 

5.7 There is no distinguishable pattern between log (lapse 
rates) and wl shown in G5.6 in the next page, except 
for a cluster of outliers is observed for high wl. 
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G5.6: Scatterplot of log(lapse rate) and wl 
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G5.7: Scatterplot of log(lapse rates) and tm 
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5.8 G5.7 shows a decreasing relationship between 
log(lapse rates) and tm, with the log(lapse rates) seems 
to be clustered into 3 groups of tm. 

5.9 A decreasing relationship between log(lapse rates) and 
ot, with ot very highly concentrated at low values and 
one extreme outlier value is shown in G5.8. 

5.10 There is no strong relationship between log(lapse rates) 
and sp as shown in G5.9 in the next page. There is a 
cluster of observations with low sp values. 









5.11 G5.10 shows a strong positive relationship between 
log(lapse rates) and dO, with few notable outliers at the 
right hand side and at the bottom of the graph. 

5.12 G5.ll shows a relationship between log(lapse rates) 
and dl which is similar to the relationship of log(lapse 
rates) and d 0, with the exception of one outlier in the 
top left of the graph. 

5.13 G5.12 in the next page shows a relationship between 
log(lapse rates) and d2 also similar to the relationship 
of log(lapse rates) and dl. 











G5.12: Scatterplot of log(lapse rate) and d2 
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5.14 The outliers observed in the explanatory analyses are 
left untreated in the observations used for the models. 
Model diagnostics in later sections will indicate the 
effect of these outliers on the model. 
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Section 6 



Multicollinearity 

6.1 One significant weakness of GLM is its inability to deal 
with multicollinearity of the explanatory variables i.e. 
where explanatory variables are highly correlated. 
Hence, the model assumes the explanatory variables 
are orthogonal. 

6.2 To detect potential issues with multicollinearity, a 
Pearson correlation matrix of the continuous 
explanatory variables is presented in G6.1. 

6.3 Whilst there is no fixed rule to confirm a 
multicollinearity problem or otherwise based on the 
Pearson correlation, the correlation coefficient between 
ot and dO of 0.59 is worth additional consideration 
during the model selection stage. 

6.4 Another procedure to detect multicollinearity is to 
calculate variance inflation factors (VIF). 

6.5 VIFi = 

1 1 -Rf 

where Rf is the coefficient of determination of the 
linear regression with X L being the response variable 
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and the remaining explanatory variables Xj where j =£ i, 
as the explanatory variables. 



G6.1: Correlation matrix of continuous explanatory variables 
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6.6 If Xi is nearly orthogonal to the remaining variables 
then Rf is small and VIF is close to 1. 



6.7 In general, VIFi > 5 indicates possible company, and another considers company as the sole 

multicollinearity problem and VIFi — 10 indicates that explanatory variable. 

multicollinearity is almost certainly a problem. 

6.8 VIF is computed with and without the explanatory 
variable of company, and tabulated in T6.1 below. 



T6.1: Variance inflation factors of continuous explanatory variables 


Explanatory V ariable 


VIF without company 


VIF with company 


W1 


1.2275 


93.7007 


Tm 


1.3288 


87.0183 


Ot 


1.6735 


44.3866 


Sp 


1.1019 


4.1908 


dO 


1.7133 


14.3445 


dl 


1.2076 


1.7100 


d2 


1.1929 


1.9343 



6.9 It is clear that the inclusion of the explanatory variable 
company in the model will result in a significant 
multicollinearity problem. 

6.10 The VIF analysis without the explanatory variable 
company did not indicate any multicollinearity issues, 
with the highest VIF being dO with 1.71 and the second 
highest being ot with 1.67, a result which corresponds 
well with the earlier findings when the Pearson 
correlation coefficients were examined. 

6.11 Hence two separate models are analysed in this paper, 
one considers all explanatory variables excluding 
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Section 7 



Poisson Model 



Model Specification 

7.1 Two separate models are considered under the Poisson 
model. The first model considers all the explanatory 
variables except company. The second model considers 
only the company explanatory variable, which will be 
presented separately in Section 12. As explained in 
Section 6, due to data reasons, combining these two 
models would be inappropriate. 



7.2 The saturated model, using the log link function, is 
expressed as: 



log(Zapse) = /?o + /3 2 wl + p 3 tm + foot + /3 5 sp + /? 6 d0 
+ /3jd\ + Pgd2 + I faiyean 

i 

+ log (exposure) 



7.3 Following from the saturated model, the null model is 
defined such that all |3i except for the intercept term |3o 
are set to zero. 



7.4 Models consisting of only one explanatory variable are 
defined such that all the |3i not associated with the 
explanatory variable are set to zero, leaving only the |3i 
associated with the explanatory variable and the 
intercept term |3od 

7.5 The log of the exposure term is set as an offset term, i.e. 
an explanatory variable of lapse where the coefficient is 
fixed to be 1. 

7.6 Thus the model would express the lapse rate i.e. ratio 
of lapse over exposure as an exponential term of the 
linear combination of (3, as follows: 

lapse 

exposure 

= e Po e P 2 wl e /3 3 tm e /3 4 ot e p s sv e p 6 d0 e P 7 dl e P 8 d2 e Y I iPg i year i 

Results and Interpretation 

7.7 The results of the models are tabulated in T7.1 in the 
next page. 



1 To keep the prototype simple, interactions terms between explanatory 
variables, and transformations of the explanatory variables are not 
considered. 
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77 . 1 : Results of the Poisson null, one explanatory variable only and saturated models 



Explanatory 

Variables 


Model 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 z 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 z 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>X) 


AIC 


Null 


modpn 


-3.1142 


0.0007 


<2e-16 


NA 


NA 


NA 


370830 


74 


NA 


371710 


W1 


modp2 


-2.9623 


0.0014 


<2e-16 


-0.3962 


0.0033 


<2e-16 


356276 


73 


<2e-16 


357159 


Tm 


modp3 


-3.0367 


0.0011 


<2e-16 


-0.2740 


0.0030 


<2e-16 


362341 


73 


<2e-16 


363223 


Ot 


modp4 


-3.0249 


0.0011 


<2e-16 


-1.2730 


0.0125 


<2e-16 


359364 


73 


<2e-16 


360246 


Sp 


modp5 


-3.1356 


0.0009 


<2e-16 


0.0970 


0.0025 


<2e-16 


369329 


73 


<2e-16 


370211 


dO 


mo dp 6 


-3.2661 


0.0012 


<2e-16 


1.3888 


0.0081 


<2e-16 


347924 


73 


<2e-16 


348806 


dl 


mo dp 7 


-3.1730 


0.0010 


<2e-16 


0.5074 


0.0064 


<2e-16 


365142 


73 


<2e-16 


366024 


d2 


modp8 


-3.1765 


0.0010 


<2e-16 


0.5349 


0.0062 


<2e-16 


363989 


73 


<2e-16 


364871 


year 


modp9 


-3.0488 


0.0015 


<2e-16 








363950 


70 


<2e-16 


364838 


1 










0.0052 


0.0022 


0.0149 










2 










-0.0881 


0.0022 


<2e-16 










3 










-0.1307 


0.0022 


<2e-16 










4 










-0.1157 


0.0022 


<2e-16 










saturated 


mo dps 


-2.7109 


0.0033 


<2e-16 








245520 


63 


<2e-16 


246422 


wl 










-0.6079 


0.0047 


<2e-16 










tm 










-0.8822 


0.0039 


<2e-16 










ot 










-2.2799 


0.0129 


<2e-16 










sp 










0.0875 


0.0027 


<2e-16 










dO 










2.5126 


0.0127 


<2e-16 










dl 










-0.0353 


0.0097 


0.0002 










d2 










0.2756 


0.0090 


<2e-16 










yearl 










-0.0487 


0.0023 


<2e-16 










year2 










-0.1282 


0.0023 


<2e-16 










year3 










-0.1436 


0.0023 


<2e-16 










year4 










-0.0915 


0.0022 


<2e-16 
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7.8 The following are some examples to aid the 
interpretation of the model results. 

7.8.1 Under the null model, (i 0 = -3.1142 and hence the null 
model's estimate of lapse rate is e^° = 4.44%, which is 
the weighted average lapse rate of the observations. 
This is the case due to the canonical link function and 
serves as a useful check at the start of model selection. 

7.8.2 In modp6 where the only explanatory variable 
considered is dO, the coefficient term is 1.3888. This can 
be interpreted as the lapse rate for policies in the first 
policy year is e^ 6 = 4.01 multiples higher than lapse 
rates for all other policy years. 

7.8.3 In modps where all the variables are considered, a 
hypothetical lapse rate can be estimated for an average 
company, in the year 2012, which has a product mix of 
50:25:25 in whole life, endowment and term policies, of 
which 10% of them are single premium and a duration 
mix of 20:20:20:40 for policies in the first, second, third 
and thereafter policy years. This is estimated to be 

e 3 o e P 2 O.SgP 3 O. 2 Sg p 5 0 .i e p 6 0 . 2 e p 7 0 . 2 e p 8 0.2 = 6 88 %. 

Model Selection 

7.9 For the Poisson model, the ratio of the intercept term or 
coefficient term to the standard error gives a z-value 



which corresponds to a standard normal distribution. 
The underlying null hypothesis is that the (3's are equal 
to zero. 

7.10 The results of the two-tailed z-test in T7.1 suggest that 
all the explanatory variables, whether assessed 
individually or in the full model, are statistically 
significant. This would in turn indicate that the 
saturated model is a good model candidate. 

7.11 The residual deviance represents the extent to which 
variation in lapse rates are not explained by the model. 
For example, the ratio of residual deviance to the null 
deviance under modp5 is 369329 + 370830 = 99.6% 
which indicates that the variable sp only explains 0.4% 
of the variation in lapse rates. 

7.12 The ratio of the residual deviance to the degrees of 
freedom follows a X 2 distribution. 

7.13 Nonetheless, due to the clear overdispersion, an 
appropriate Poisson model shall not be selected. 
Correspondingly the topic of model selection will be 
discussed in more detail for other models where 
overdispersion is addressed. Model selection methods 
using the X 2 , AIC and similar criteria will not be 
discussed for the Poisson model. 
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Overdispersion 

7.14 For a well-fitted model, the residual deviance should 
approximately equal to the residual degrees of 
freedom. Overdispersion is said to arise when the 
residual deviance exceeds the residual degrees of 
freedom, which in turn arise because the variance of 
the observations is higher than the variance implied by 
the model. 

7.15 From the results in T7.1, clear overdispersion is 
observed. In this prototype, overdispersion is clearly 
due to the combination of the use of summarised data 
and other potentially more useful and precise 
explanatory variables - such as distribution channels, 
target market and conservation strategy, are not 
examined due to the lack of data availability. 

7.16 There are several ways to deal with overdispersion. 
Naturally, refitting the model with individual data 
instead of summarized data, or with better explanatory 
variables would reduce overdispersion. 

7.17 As these are not immediately available options, 
another way of dealing with overdispersion in a 
Poisson model is to extend the model to a quasi- 
Poisson model. 



7.17.1 A quasi-Poisson distribution has mean equal to the 
underlying Poisson distribution but a multiplicative 
dispersion parameter estimated from the observations 
applied to the variance of the distribution. 

7.17.2 This would imply that the coefficients calculated under 
the underlying Poisson model would equal those 
calculated under its quasi equivalent, but the standard 
errors of the coefficients would be underestimated. 

7.17.3 Typically an overly complex model would be selected 
if overdispersion is ignored as the explanatory power 
of coefficients would be overestimated. 

7.17.4 It is useful to emphasize that the objective of using a 
quasi model is merely to extend the prototype to 
investigate this methodology. It is not an implicit 
judgment that the |3i's calculated by the Poisson model 
is considered appropriate and acceptable. 

7.17.5 Extending to a quasi model is merely a technical fix 
applied when we are constrained by model use and 
data limitation, and not a default solution. 

7.18 Another method of dealing with overdispersion in a 
count model is to use a negative binomial regression 
model. 
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7.18.1 The difference between the negative binomial model 
and the quasi-Poisson model is that the former's 
dispersion parameter is a quadratic function of the 
mean, whereas the latter's dispersion parameter is a 
linear function of the mean. 

7.18.2 The negative binomial model also has the advantage of 
having a likelihood function, and hence model 
selection can be performed using maximum likelihood 
techniques. 
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Section 8 

Quasi-Poisson Model 

Model Specification 

8.1 The link function used for the quasi-Poisson model is 
the same as the link function used for the Poisson 
model. 

8.2 The company only model under quasi-Poisson model 
is presented in Section 12. 

Results and Interpretation 

8.3 T8.1 in the next page tabulates the quasi-Poisson model 
results. 

8.4 The intercept and coefficient terms in the Poisson 
model and the quasi-Poisson models are the same, as 
the mean is unchanged and the dispersion parameter is 
only applied to the variance. 

8.5 It follows that the standard errors, after allowing for 
the dispersion parameter, is significantly higher in the 
quasi-Poisson model compared to the Poisson model. 



8.6 Having allowed for overdispersion, it is now sensible 
to present results in terms of the standard errors. As an 
example, under modqp6, within one standard error 
around the mean, the lapse rate for policies in the first 
policy year is between e^ 6 x ±se( ^ 6 ) = (2.27, 7.06) 
multiples higher than the lapse rate of policies in all 
other policy years. 

8.7 The adjustment to the variance meant that the ratio of 
the coefficients to the standard errors follows a t- 
distribution instead of a standard normal distribution. 

8.8 Likewise, the F-test instead of Chi-square test is 
appropriate for the ratio of the residual deviance to the 
degrees of freedom. 

Model selection 

8.9 For the models with one explanatory variable only, 
only under modqp6 we would reject the null 
hypothesis that (3 6 = 0 using the t-test and 5% 
confidence level. This contrasts with the Poisson model 
where all the null hypotheses of models with one 
explanatory variable only are rejected under the z-test. 

8.10 Similarly, under the full model, coefficients sp, dl, d2 
and year are no longer statistically significant at 5% 
confidence level. 
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T8.1: Results of the quasi-Poisson null, one explanatory variable only and saturated models 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 t 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 t 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>F) 


Dispersion 


Null 


modqpn 


-3.1142 


0.0515 


<2e-16 


NA 


NA 


NA 


370830 


74 


NA 


5470 


W1 


modqp2 


-2.9623 


0.1020 


<2e-16 


-0.3962 


0.2369 


0.0988 


356276 


73 


0.0993 


5222 


Tm 


modqp3 


-3.0367 


0.0793 


<2e-16 


-0.2740 


0.2200 


0.2170 


362341 


73 


0.2144 


5413 


Ot 


modqp4 


-3.0249 


0.0795 


<2e-16 


-1.2730 


0.9053 


0.1640 


359364 


73 


0.1430 


5231 


Sp 


modqp5 


-3.1356 


0.0662 


<2e-16 


0.0970 


0.1846 


0.6010 


369329 


73 


0.6036 


5518 


dO 


modqp6 


-3.2661 


0.0804 


<2e-16 


1.3888 


0.5658 


0.0165 


347924 


73 


0.0331 


4853 


dl 


modqp7 


-3.1730 


0.0753 


<2e-16 


0.5074 


0.4689 


0.2830 


365142 


73 


0.3052 


5335 


d2 


modqp8 


-3.1765 


0.0742 


<2e-16 


0.5349 


0.4514 


0.2400 


363989 


73 


0.2629 


5375 


Year 


modqp9 


-3.0488 


0.1138 


<2e-16 








363950 


70 


0.8751 


5679 


1 










0.0052 


0.1621 


0.974 










2 










-0.0881 


0.1646 


0.594 










3 










-0.1307 


0.1653 


0.432 










4 










-0.1157 


0.1643 


0.484 










saturated 


modqps 


-2.7109 


0.2079 


<2e-16 








245520 


63 


0.0041 


3972 


wl 










-0.6079 


0.2990 


0.0463 










tm 










-0.8822 


0.2473 


0.0007 










ot 










-2.2799 


0.8125 


0.0067 










sp 










0.0875 


0.1727 


0.6143 










dO 










2.5126 


0.8021 


0.0026 










dl 










-0.0353 


0.6119 


0.9542 










d2 










0.2756 


0.5675 


0.6289 










yearl 










-0.0487 


0.1436 


0.7356 










year2 










-0.1282 


0.1448 


0.3793 










year3 










-0.1436 


0.1440 


0.3225 










year4 










-0.0915 


0.1400 


0.5159 
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8.11 Had we adopted the Poisson model without taking 
into account overdispersion, the predictive power of 
the explanatory variables included in our models 
would have been overstated. 

8.12 There are several ways to perform model selection. 

8.12.1 Here a stepwise backwards elimination algorithm 
using the F-test is used, where the starting incumbent 
candidate model is the saturated model. 

8.12.2 The partial F statistic for each explanatory variable is 
performed in presence of all other explanatory 
variables. This implies that the challenging candidate 
models are all the models with one less explanatory 
variable than the starting incumbent candidate model. 

8.12.3 The explanatory variable with the largest p-value, if 
the p-value is larger than the selected confidence level, 
is identified. 

8.12.4 Here the confidence level is selected to be 5%. The 
model without this identified explanatory variable 
shall be the successful challenging candidate model 
and thus replaces the incumbent candidate model. 

8.12.5 This process is repeated until the largest p-value is less 
than 5%, where no challenging candidate model shall 



replace the incumbent candidate model under this 
algorithm. 

8.13 The results are tabulated in T8.2 in the next page. 

8.14 The backwards elimination algorithm has yielded the 
following model. 

lapse 

exposure 

_ e -2.7469 e -0.6250vW e -0.8666tm e -2.3221Ot e 2.5742d0 



8.15 The results of the model can be summarized as a 
multiplicative table in T8.3. 

T8.3: Multiplicative table of lapse rates given by F-test stepwise 
backwards elimination algorithm under quasi-Poisson model 



Base 




Lapse 

Rate 


6.41% 



Product Type 
Whole Life Ifo 



Term 

Others 



Policy Duration 



Subsequent 

Years 



8.16 Whilst this model is statistically significant, it is critical 
to note that the coefficient value of dO is very high. It 
implies that lapse rates in the first policy year is e 2,5742 
= 1312% higher than lapse rates in other policy years. 





T8.2: Results of the F-test stepwise backwards elimination algorithm under quasi-Poisson model 



Iteration 


Explanatory Variables 


Residual Deviance 


Deg. of Freedom 


P(>F) Action 


1 


none 


245520 








wl 


261473 


1 


0.0047 




tm 


294659 


1 


0.0007 




ot 


275594 


1 


0.0072 




sp 


246531 


1 


0.6124 




dO 


278312 


1 


0.0005 




dl 


245533 


1 


0.9537 remove 




d2 


246434 


1 


0.6298 




year 


250843 


4 


0.8490 


2 


none 


245533 








wl 


261573 


1 


0.0450 




tm 


294705 


1 


0.0007 




ot 


276484 


1 


0.0060 




sp 


246535 


1 


0.6111 




dO 


279423 


1 


0.0042 




d2 


246881 


1 


0.5554 




year 


250858 


4 


0.8452 remove 


3 


none 


250858 








wl 


268286 


1 


0.0332 




tm 


302957 


1 


0.0004 




ot 


282919 


1 


0.0044 




sp 


251179 


1 


0.7688 remove 




dO 


284990 


1 


0.0033 




d2 


253555 


1 


0.3955 



24 




Iteration Explanatory Variables Residual Deviance Deg. of Freedom P(>F) Action 







4 




None 


251179 




















wl 


269525 


1 


0.0280 
















tm 


303196 


1 


0.0003 
















ot 


283359 


1 


0.0041 
















dO 


285887 


1 


0.0029 
















d2 


253959 


1 


0.3852 


remove 










5 




None 


253959 




















wl 


271847 


1 


0.0296 
















tm 


304112 


1 


0.0004 
















ot 


290429 


1 


0.0023 
















dO 


294118 


1 


0.0014 






























Explanatory 


Model 


Intercept 


Intercept 


Intercept 


Coefficient Coefficient 


Coefficient 


Residual 


Deg. of 


P(>F) 


Dispersion 


Variables 


Name 


Value 


Std. Err 


P(> 1 1 1 ) 


Value Std. Err 


P(> 1 1 1 ) 


Deviance 


Freedom 






Backwards 


modqpf 


-2.7469 


0.1816 


<2e-16 






253959 


70 


2.6e-5 


3700 



wl 


-0.6250 


0.2800 


0.0288 


tm 


-0.8666 


0.2313 


0.0004 


ot 


-2.3221 


0.7350 


0.0023 


dO 


2.5742 


0.7226 


0.0007 
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8.17 In Section 6 we demonstrated that the explanatory 
variables ot and dO have a high Pearson correlation 
coefficient, and this has manifested into an 
unsatisfactory model. For example, this implies a lapse 
rate of 84.1% for the first policy year of an endowment 
policy. 

8.18 One simple method to eliminate this unsatisfactory 
feature in the model is to consider dropping dO and ot. 
The results are summarized in T8.4 in the next page. 

8.18.1 Between the 3 models considered, the Adjl model is a 
weak candidate as the coefficient ot has a high p-value. 

8.18.2 The Adj3 model has acceptable p-values for both the t- 
test and the F-test, however, its functionality is lower 
than the Adj2 model which has an extra coefficient, 
although this extra coefficient does have a high p- 
value. 

8.18.3 Comparing the p-value of the F-tests between the 
models would be spurious. 

8.18.4 A judgment is taken to adopt the adj2 model as the 
preferred candidate of quasi-Poisson models, and the 
results are summarized in T8.5. 



T8.5: Multiplicative table of lapse rates under the quasi- 
Poisson preferred candidate model 



Product Ty 


:>e 


Whole Life 


0.38 


Endowment 
and Others 


1.00 


Term 


0.42 



Policy Duration 


First Year 


2.28 


Subsequent 

Years 


1.00 



Base 




Lapse 

Rate 


7.54% 



8.19 It is useful to note that another theoretical difference 
between the quasi-Poisson model and the Poisson 
model is that the former does not have a likelihood 
function, and hence model selection methods based on 
maximum likelihood such as Akaike Information 
Criteria (AIC) is not relevant for quasi models. 2 

Model diagnostics 

8.20 Having selected the preferred model and prior to 
presenting the final results, the preferred model shall 
be diagnosed to investigate the relationship between 
the observations and the model and to ensure that the 
underlying assumptions of the model are not violated. 

2 There are research papers which suggest the use quasi-likelihood functions 
to derive a quasi-AIC for model selection. This is not considered in this 
paper as such methodologies, whilst gaining traction, are not yet widely 
accepted in the mainstream. 
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T8.4: Results of the quasi-Poisson model adjusted for multicollinearity 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 t 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 t 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>F) 


Dispersion 


Drop dO 


Adjl 


-2.4144 


0.1641 


<2e-16 








294118 


71 


0.0015 


4472 


wl 










-0.9900 


0.2892 


0.0010 










tm 










-0.9357 


0.2541 


0.0004 










ot 










-0.7734 


0.7702 


0.3187 










Drop ot 


Adj2 


-2.5850 


0.1884 


<2e-16 








290429 


71 


0.0007 


4212 


wl 










-0.9707 


0.2773 


0.0008 










tm 










-0.8702 


0.2500 


0.0009 










dO 










0.8259 


0.5336 


0.1261 










Drop both 


Adj3 


-2.4418 


0.1631 


<2e-16 








299293 


72 


0.0008 


4499 


wl 










-1.0693 


0.2747 


0.0002 










tm 










-0.9240 


0.2548 


0.0005 
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8.21 



There are various model diagnostic tests which could 
be performed. The studentised deviance residuals, hat 
diagonals. Cook's distance, COVRATIO, DFFITS and 
DFBETA, commonly used diagnostics for GLM models 
are examined. 

8.22 Diagnostic tests, accompanied by generally accepted 
rule of thumbs, indicate observations or results of the 
model where further investigations are required. 



G8.1: Scatterplot of studentised deviance residuals and fitted 
values 




fitted values 



8.23 G8.1 shows the studentised deviance residuals are 

roughly evenly distributed around the horizontal 



central dotted line of zero, and there are no specific 
patterns suggested by the local regression curve. 

8.24 Studentised deviance residuals in excess of 3, as 
illustrated by the upper and lower horizontal dotted 
lines in G8.1, are generally considered as outliers. This 
outlier at the bottom of the graph is observation 19. 

G8.2: Normal QQ plot of sample quantiles and theoretical 
quantiles of studentised deviance residuals 
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theoretical quantiles 



8.25 G8.2 shows that the studentised deviance residuals are 

approximately normally distributed, as shown by the 
proximity of the dots to the solid line. As with 
empirical data, deviation from the normal distribution 
for tail values of studentised deviance residuals are 




common. The two outliers at the lower end of the tail 
are observations 19 and 33, and the outlier at the 
higher end of the tail is observation 10. 



8.26 The hat diagonal is a measure of the distance of an 
observation from others, in terms of the response 
variable. Hat diagonal values larger than 2 x 

(number of observations- residual degrees of freedom) _ ^ 670/ 
number of observations 

as illustrated by the horizontal dotted line, are deemed 
to be highly influential. 




8.27 G8.3 shows the hat diagonals for the observation 19 in 

particular is highly influential. The other influential 
observations are 7, 10, 22, 27, 37, 52, and 67. 



8.28 The Cook's distance measure the impact of each 
observation on the regression coefficients and fitted 

4 

values. Values larger than : — = 5.33% 

0 number of observations 

are considered highly influential. 



G8.4: Cook's distance plot for ordered observations 




id 



8.29 As shown by the horizontal dotted line in G8.4, 
observations 19 and 33 exceeds this threshold, with 
observation 19 being extremely influential. 

8.30 The COVRATIO measures the impact of an 
observation on the variance and covariance of the 
coefficients. Values outside the range given by 1 + 3 x 





(number of observations- residual degrees of freedom) 
number of observations 

(0.84, 1.16), are considered highly influential. 



8.31 In G8.5 it is observed that observation 19 is highly 
influential, whilst there are also several other 
influential observations including 7, 18, 37, 52, 57 and 
72 which falls marginally outside the range as 
illustrated by the horizontal dotted lines. 




8.32 DFFITS measures the impact of an observation on the 
fitted values. Observations outside the range of +2 x 

(number of observations- residual degrees of freedom) _ 

■J number of observations 

(—0.46, 0.46) are considered to be highly influential. 



8.33 Despite observation 19 being established as an outlier, 
it remains within the acceptable range illustrated by 
the horizontal dotted line as shown in G8.6. 



G8.6: DFFITS plot for ordered observations 




id 



8.34 DFBETA is a measure of the impact of an observation 

on the estimate of the coefficient, including the 

intercept. A rule of thumb is to consider observations 

2 

outside the range of ± : — to be highly 

° number of observations ° J 

influential. There is a DFBETA statistic for each 
observation with respect to each coefficient. 






dfbeta wl 00 dfbeta intercept 




.8: DFBETA plot for ordered observations for wl 




0 15 30 46 60 75 



id 



G8.9: DFBETA plot for ordered observations for tm 




0 15 30 45 60 75 

id 



G8.10: DFBETA plot for ordered observations for dO 




Id 
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8.35 



G8.7, G8.8, G8.9 and G8.10 in the previous page show 
that observation 19 is very influential on the intercept, 
tm and dO coefficients but not on the wl coefficient. 

8.36 On overall, with the exception of observation 19 which 
is a clear outlier, and observation 33 which is a 
borderline outlier, the rest of the observations and the 
model are mostly within the acceptable range as 
suggested by the rule of thumbs. 

Model Lift 

8.37 Model lift is a relative measure of the predictive power 
amongst models. One way to assess model lift is to sort 
the fitted values into groups and compare the average 
lapse rate in each group against the average lapse rate 
of all the fitted values i.e. a relative lapse rate. 

8.38 The model lift plot in G8.ll shows the relative lapse 
rates for each decile, with decile 1 least likely to lapse, 
and decile 10 most likely to lapse. 

8.39 In an entirely random model, the relative lapse rate in 
each decile would be the average lapse rate of all the 
fitted values. 

8.40 In a good model the relative lapse rates would be low 
in the lower deciles and high in the higher deciles. 



8.41 An indication of the predictive power of the model can 
be measured by the relative lapse rate of the top 
deciles. For example, in G8.ll, a 10% sample of the 
population using the model would yield a lapse rate of 
43% higher than a random 10% sample. 




8.42 Typically these would be fitted values from the 
validation data set i.e. a randomized dataset used for 
model validation purpose. However, as data was not 
set aside for validation in this prototype, the 
observations used to fit the model were also used to 
assess lift. This is merely for illustration purposes and 
such an analysis by itself is not very meaningful. 




Section 9 

Negative Binomial Model 

Model Specification 

9.1 The link function used for the negative binomial model 
is the log link function, hence similar to the Poisson 
model. The main difference between the negative 
binomial regression model and the Poisson model is 
the dispersion parameter. 

9.2 The company only model will be presented in Section 
12 . 

Results and Interpretation 

9.3 The results of the negative binomial model are 
tabulated in T9.1 in the next page. 

9.4 The results shall be interpreted in a similar way to the 
results of the Poisson model as the link function is the 
same. Under the null model, the intercept value /? 0 = - 
2.7453 gives a lapse rate of e^° = 6.37% which is no 
longer the weighted average lapse rate under the 
Poisson model, as this is not a canonical link function 
for the negative binomial distribution. 



9.5 Due to the allowance for overdispersion, the p-values 
for the coefficients increased if compared to the results 
of the Poisson distribution. 

Model selection 

9.6 For the models with one explanatory variable only, the 
null hypothesis that (3^ = 0 would be rejected for 
modnb3, modnb4 and modnb6 under using the t-test 
and 5% confidence level. Modnb7 has a 6.23% p-value. 

9.7 Similarly, under the saturated model, the coefficients 
wl, sp, dl, d2 and year are no longer statistically 
significant at 5% confidence level when compared 
against the Poisson model in Section 7. 

9.8 The AIC is an information-theoretic procedure used to 
compare between models, where a general rule of 
thumb is that, all else being equal, the model with a 
lower AIC is better. The AIC is given by the formula 
below. 

AIC = 2 x p — 2 x \og(likelihood ) 

,where p is the number of parameters in the model. 
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T9.1: Results of the negative binomial null, one explanatory variable only and saturated models 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 z 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 z 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>X) 


AIC 


Dispersion 


null 


modnbn 


-2.7453 


0.0662 


<2e-16 


NA 


NA 


NA 


79.094 


74 


NA 


1621.8 


3.0393 


wl 


modnb2 


-2.8312 


0.1322 


<2e-16 


0.2498 


0.3755 


0.506 


79.078 


73 


0.5638 


1623.4 


3.0515 


tm 


modnb3 


-2.4064 


0.0873 


<2e-16 


-1.3623 


0.2285 


2.5e-9 


78.014 


73 


4.6e-7 


1598.4 


4.1443 


ot 


modnb4 


-2.6576 


0.0716 


<2e-16 


-2.0861 


0.6210 


0.0008 


78.726 


73 


0.0047 


1615.8 


3.3483 


sp 


modnb5 


-2.6981 


0.0846 


<2e-16 


-0.2502 


0.2261 


0.2690 


79.032 


73 


0.2626 


1622.5 


3.0856 


dO 


modnb6 


-3.0618 


0.0929 


<2e-16 


2.1171 


0.5183 


4.4e-5 


78.561 


73 


0.0006 


1612.1 


3.5017 


dl 


modnb7 


-2.8903 


0.0965 


<2e-16 


0.9911 


0.5316 


0.0623 


79.006 


73 


0.1783 


1622.0 


3.1064 


d2 


modnb8 


-2.7965 


0.0959 


<2e-16 


0.3153 


0.5242 


0.5480 


79.083 


73 


0.6364 


1623.6 


3.0475 


year 


modnb9 


-2.6963 


0.1461 


<2e-16 








78.985 


70 


0.6921 


1627.5 


3.1224 


1 










0.0682 


0.2067 


0.741 












2 










-0.0301 


0.2067 


0.884 












3 










-0.1558 


0.2067 


0.451 












4 










-0.1966 


0.2067 


0.341 












saturated 


modnbs 


-2.4589 


0.2042 


<2e-16 








77.152 


63 


1.9e-7 


1590.9 


5.8395 


wl 










-0.0919 


0.3008 


0.7599 












tm 










-1.2438 


0.2220 


2.10e-8 












ot 










-3.4574 


0.6082 


1.31e-8 












sp 










0.1118 


0.1726 


0.5171 












dO 










1.7782 


0.5255 


0.0007 












dl 










-0.4115 


0.4262 


0.3343 












d2 










0.1450 


0.4137 


0.7169 












yearl 










0.0763 


0.1544 


0.6210 












year2 










-0.0071 


0.1543 


0.9634 












year3 










-0.1309 


0.1559 


0.4013 












year4 










-0.1025 


0.1540 


0.5058 
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9.9 Broadly speaking the AIC model selection process 
suggests that a better model is one where the log 
likelihood shall increase by more than the incremental 
number of parameters. 

9.10 As shown in T9.1, all the one explanatory variable 
models have a lower AIC than the null model, and 
hence are better models. The saturated model has a 
lower AIC than all the one explanatory variable models 
and hence the saturated model would be the better 
model according to the AIC criteria. 

9.11 It is also important to note that comparisons using AIC 
would be less appropriate when there is 
overdispersion, as the assumptions underlying the 
likelihood function are challenged. Hence under the 
Poisson model, model selection based on AIC was not 
performed. 

9.12 It is useful to note that other information criteria, such 
as the Bayesian Information Criteria (BIC) can also be 
applied to model selection in GLM, and the AIC is 
merely one approach. 

9.13 One way to perform model selection is by a stepwise 
backwards process using AIC as the criteria. 



9.13.1 The starting incumbent candidate model is the 
saturated model. 

9.13.2 The AIC is calculated for each challenging candidate 
model, where challenging candidate models have one 
less explanatory variable than the incumbent candidate 
model. 

9.13.3 If any challenging candidate model has a lower AIC 
than the incumbent candidate model, the challenging 
candidate model with the lowest AIC replaces the 
incumbent candidate model. 

9.13.4 This process is repeated until no challenging candidate 
model can replace the incumbent candidate model 
under this algorithm. The results are shown in T9. 2 in 
the next page. 

9.14 The stepwise backwards AIC algorithm has yielded the 
following model. 

^ a V Se _ e -2.5630 e -1.1534tm e — 3.4043ot e 1.8449d0 

exposure 

9.15 The results of the model yielded by the stepwise 
backwards AIC algorithm are summarized in T9.3. 
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T9. 2: Results of stepwise backwards AIC algorithm under negative binomial model 



Iteration 


Explanatory Variables 


AIC 


action 


1 


none 


1588.9 






wl 


1587.0 






tm 


1609.3 






ot 


1606.2 






sp 


1587.4 






dO 


1597.7 






dl 


1587.3 






d2 


1587.0 






year 


1583.3 


remove 


2 


none 


1583.3 






wl 


1581.3 






tm 


1602.3 






ot 


1601.1 






sp 


1581.7 






dO 


1593.3 






dl 


1581.5 






d2 


1581.3 


Remove 


3 


none 


1581.3 






wl 


1579.3 


remove 




tm 


1600.3 






ot 


1599.1 






sp 


1579.7 






dO 


1591.3 






dl 


1579.5 




4 


none 


1579.3 






tm 


1598.8 






ot 


1598.1 






sp 


1577.8 






dO 


1589.1 






dl 


1577.6 


remove 
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Iteration 


Explanatory Variables 


AIC action 


5 


none 


1577.6 




tm 


1596.9 




ot 


1596.1 




sp 


1576.0 remove 




dO 


1587.5 


6 


none 


1576.0 




tm 


1595.0 




ot 


1594.2 




dO 


1585.7 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 z 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 z 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>X) 


AIC 


Dispersion 


Backwards 


modqpf 


-2.5630 


0.1039 


<2e-16 








77.237 


71 


8.7e-ll 


1578.0 


5.6202 


tm 










-1.1534 


0.2027 


1.25e-8 












ot 










-3.4043 


0.5937 


9.81e-9 












dO 










1.8449 


0.5183 


0.0004 
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T9.3: Multiplicative table of lapse rate given by stepwise 
backwards AIC algorithm under negative binomial model 



X 

9.16 Similar to the quasi-Poisson model, the coefficient 
value of dO is rather high, implying that the lapse rate 
in the first policy year is 633% higher than in 
subsequent policy years. Again, dropping the 
explanatory variables dO and ot is considered. The 
results are summarized in T9.4 in the next page. 

9.16.1 Between the 3 models considered, both Adjl and Adj2 
models are strong candidates as the p-values under the 
z-tests of the coefficients and Chi-square tests are low. 

9.16.2 A comparison of the p-values between Adjl and Adj2 
would be spurious. Adjl does have a lower AIC than 
Adj2. 

9.16.3 Adj2 is selected as the preferred candidate model for 
the negative binomial model, as it is functionally better 
than Adjl due to the policy duration coefficient. 



Policy Duration 



First Year 6.33 

Subsequent ^ 
Years 




9.16.4 Modnb3 is a weak candidate as dropping the variables 
individually already sufficiently provides two strong 
candidate models, and dropping both explanatory 
variables would significantly reduce the functionality 
of the model without improving on statistical 
significance. The AIC of modnb3 is higher than Adjl 
and Adj2. 

9.17 The results of the preferred candidate model under the 
negative binomial distribution are summarized in T9. 5 
below. 

T9.5: Multiplicative table of preferred negative binomial 

candidate model 

Product Type 

Base Whole Life, 

Lapse 7.20% X Endowment 

Rate and Others 

Term 

Model diagnostics 

9.18 Similar to the quasi-Poisson model, model diagnostics 
are carried out for the preferred candidate negative 
binomial model. 






T9.4: Results of the negative binomial model adjusted for multicollinearity 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 z 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 z 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>X) 


AIC 


Dispersion 


Drop dO 
tm 
ot 


Adjl 


-2.2816 


0.0889 


<2e-16 


-1.4255 

-2.3039 


0.2135 

0.5215 


2.42e-ll 

9.95e-6 


77.583 


72 


5.4e-9 


1587.7 


4.8495 


Drop ot 
tm 
dO 


Adj2 


-2.6313 


0.1175 


<2e-16 


-1.1667 

1.2054 


0.2300 

0.4795 


3.94e-7 

0.0119 


77.861 


72 


3.7e-7 


1596.2 


4.3653 


Drop both 
tm 


modnb3 


-2.4064 


0.0873 


<2e-16 


-1.3623 


0.2285 


2.51e-9 


78.014 


73 


4.6e-7 


1598.4 


4.1443 
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9.19 



It is useful to note that the preferred candidate model 
under the negative binomial distribution has one 
parameter less (and hence one residual degrees of 
freedom more) than the quasi-Poisson model. The 
value of the thresholds implied by the rule of thumbs 
are different from those calculated under the quasi- 
Poisson model. 

9.20 G9.1 shows the studentised deviance residuals are 
roughly evenly distributed around the horizontal 
central dotted line of zero, with observation 19 being 
the clear outlier. 

9.21 The clusters towards the right hand side in G9.1 
exaggerates the downwards pattern of the local 
regression curve. These are observations from 
company 7. 

9.22 G9.2 shows that the studentised deviance residuals are 
approximately normally distributed, as shown by the 
proximity of the dots to the solid line. The two outliers 
at the lower end of the tail are observations 19 and 34. 
The outlier at the higher end of the tail is observation 5. 



G9.1: Scatterplot of studentised deviance residuals and fitted 
values 




7 8 9 10 11 12 

fitted values 



G9.2: Normal QQ plot of sample quantiles and theoretical 
quantiles of studentised deviance residuals 




-2 -1 0 

theoretical quantiles 



2 








9.23 



G9.3 shows that in addition to observation 19 which 
was observed to have a high hat diagonal under the 
quasi-Poisson distribution, observation 65 is also 
indicated to be highly influential, with the other 
observations above the dotted line being 34, 49 and 64. 

9.24 The threshold for the Cook's distance, as illustrated by 
the dotted line in G9.4, is also breached by 
observations 5 and 34. 
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G9.4: Cook's distance plot for ordered observations 




G9.5: COVRATIO plot for ordered observations 






dffits 



9.25 



Observations 19, 33 and 65 are indicated to be 
significantly influential in G9.5 in the previous page, 
with observations 3, 18 and 64 very close to the 
COVRATIO threshold indicated by the horizontal 
dotted line. 

9.26 The DFFITS in G9.6 does not indicate any significant 
influential observations. 

9.27 In G9.7, G9.8, and G9.9 in the next page, the DFBETA 
for observation 19 indicates this observation is highly 
influential on the intercept and the dO coefficient but 
not the tm coefficient. 




G9.7: DFBETA plot for ordered observations for intercept 




G9.8: DFBETA plot for ordered observations for tm 




0 15 30 46 60 75 

id 
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G9.9: DFBETA plot for ordered observations for dO 




0 15 30 46 60 75 

id 



Model Lift 

9.28 In G9.10, the top decile has a relative lapse rate of 2.27. 

9.29 Compared to the quasi-Poisson model, the negative 
binomial performs significantly better in the top and 
bottom decile, but not so much in the middle deciles. 






Section 10 



Binomial Model 



Model Specification 

10.1 Similar to the Poisson model, two separate models are 
considered, and the company only model is discussed 
in Section 12. 

10.2 A binomial distribution with parameters n and p has a 
mean of np and a variance of npq. For small 
probabilities of success p, the mean would be 
approximately equal to the variance, and hence the 
Poisson model would be a close approximation of the 
binomial model. The expectation is that the results of 
the binomial model will be very close to those of the 
Poisson model. 

10.3 The saturated binomial model is expressed as: 



l°g( 



lapse 



;) 



exposure — lapse ) 

= /? 0 + /3 2 wl + /3 3 tm + /3 4 ot + /3 5 sp 



+ 



(3 6 d0 + p 7 dl + (3 e d2 + ^ /3 9i yeari 



10.4 Thus the model would express the lapse rate i.e. ratio 
of lapse over exposure as an exponential term as 
follows: 

lapse 

exposure 

1 

~ I + e -(/?o+/?2wi+/?3tm+/? 4 ot+/? 5 sp+/? 6 dO+/? 7 dl+/? 8 d2+2i/?9iyeari) 

10.5 The null model is considered in the same way as set 
out for the Poisson model. The models with only one 
explanatory variable are not examined as it will be 
similar to the results of the Poisson model. 

Results and Interpretation 

10.6 The results of the binomial model are tabulated in 
T10.1 in the next page. 
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T10.1 Results of the binomial null and saturated models 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 z 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 z 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>X) 


AIC 


Null 


modbn 


-3.0688 


0.0007 


<2e-16 


NA 


NA 


NA 


389867 


74 


NA 


390742 


saturated 


modbs 


-2.6462 


0.0034 


<2e-16 








257525 


63 


<2e-16 


258422 



wl 


-0.6326 


0.0049 


<2e-16 


tm 


-0.9308 


0.0040 


<2e-16 


ot 


-2.4497 


0.0136 


<2e-16 


sp 


0.0904 


0.0028 


<2e-16 


dO 


2.7148 


0.0136 


<2e-16 


dl 


-0.0380 


0.0100 


0.0001 


d2 


0.2858 


0.0092 


<2e-16 


yearl 


-0.0553 


0.0023 


<2e-16 


year2 


-0.1374 


0.0024 


<2e-16 


year3 


-0.1524 


0.0023 


<2e-16 


year4 


-0.0971 


0.0023 


<2e-16 
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10.7 Under the null model, (1 0 = -3.0688, hence the null 
model's lapse rate estimate of 1+e _ (/?o) = 4.44% is the 
weighted average of the observations due to the use of 
the canonical link function. 

10.8 In modbs where all the variables are considered, 
similar to the Poisson model, a hypothetical lapse rate 
can be estimated for a company, in the year 2012, 
which has a product mix of 50:25:25 in whole life, 
endowment and term policies, of which 10% of them 
are single premium and a duration mix of 20:20:20:40 
for policies in the first, second, third and thereafter 
policy years. The lapse rate shall be estimated as 

ICVpSG 1 . Q[-(y 

exposure ~ i +e -(/?o+ /? 2 °-5+/? 3 °-2 + /? s o.i+/? 6 o. 2 +/? 7 o. 2 +/? 8 o. 2 ) /o - 

Overdispersion 

10.9 The binomial model exhibits clear overdispersion as 
the residual deviance is significantly larger than the 
degrees of freedom. 

10.10 Similar to the quasi-Poisson distribution, a quasi- 
binomial distribution has mean equal to the underlying 
binomial distribution but a multiplicative dispersion 
parameter estimated from the observations applied to 
the variance. 

10.11 The quasi-binomial model will have the same 
coefficients as the binomial model, but different 
standard errors. 
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Section 11 

Quasi-binomial Model 

Model Specification 

11.1 The link function used for the quasi-binomial model is 
the same as the link function for the binomial model. 

11.2 Similar to the binomial model, the company only 
model will be investigated in Section 12. 

Results and Interpretation 

11.3 The results of the quasi-binomial model are tabulated 
in Til. 1 in the next page. 



saturated model. The results are tabulated in Til. 2 in 
the subsequent pages. 

11.6 The F-test stepwise backwards elimination algorithm 
has yielded the following model. 

lapse 

exposure 

1 

_ -|- g(2.6852 + 0.6529wi + 0.9164tm+2.4856ot— 2.7718d0) 

11.7 Similar to the quasi-Poisson and negative binomial 
models, the coefficient value of dO is very high in the 
presence of the explanatory variable ot. It implies that 
lapse rates for endowment policies in the first policy 
year is 1+e ( 2 .6852 - 2 . 771 s) — 52.2 /o. 



Model Selection 

11.4 Similar to the quasi-Poisson model, allowing for 
overdispersion under the quasi-binomial model led to 
the coefficient of sp, dl, d2 and year no longer being 
statistically significant as compared to the binomial 
model. 

11.5 Similar to the quasi-Poisson model, a backwards 
elimination algorithm using the F-test is performed, 
where the starting incumbent candidate model is the 



11.8 One simple method to eliminate this unsatisfactory 
feature in the model is to consider dropping dO and ot. 
The results are summarized in the Til. 3 in the 
subsequent pages. 

11.9 The results are similar to the results for the quasi- 
Poisson model, and for comparison sake, Adj2 is 
selected as the preferred candidate model for the quasi- 
binomial model. 
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Tll.l Results of the quasi-binomial null and saturated models 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 1 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 1 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>F) 


Dispersion 


Null 


modqbn 


-3.0688 


0.0539 


<2e-16 


NA 


NA 


NA 


389867 


74 


NA 


5724.6 


Saturated 


modqbs 


-2.6462 


0.2200 


<2e-16 








257525 


63 


<2e-16 


4163.8 



wl 


-0.6326 


0.3180 


0.0051 


tm 


-0.9308 


0.2609 


0.0007 


of 


-2.4497 


0.8748 


0.0068 


sp 


0.0904 


0.1814 


0.6200 


dO 


2.7148 


0.8789 


0.0030 


dl 


-0.0380 


0.6431 


0.9531 


d2 


0.2858 


0.5951 


0.6328 


yearl 


-0.0553 


0.1515 


0.7163 


year2 


-0.1374 


0.1523 


0.3704 


year3 


-0.1524 


0.1513 


0.3175 


year4 


-0.0971 


0.1470 


0.5113 
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Til. 2 Results of the F-test stepwise backwards elimination algorithm under quasi-binomial model 



Iteration 


Explanatory Variables 


Residual Deviance 


Deg. of Freedom 


P(>F) action 


1 


none 


257525 








wl 


273536 


1 


0.0522 




tm 


309188 


1 


0.0007 




ot 


289417 


1 


0.0069 




sp 


258551 


1 


0.6181 




dO 


292475 


1 


0.0005 




dl 


257539 


1 


0.9525 remove 




d2 


258462 


1 


0.6337 




year 


263172 


4 


0.8463 


2 


none 


257539 








wl 


273672 


1 


0.0450 




tm 


309254 


1 


0.0007 




ot 


290367 


1 


0.0058 




sp 


258556 


1 


0.6169 




dO 


293680 


1 


0.0039 




d2 


258888 


1 


0.5647 




year 


263185 


4 


0.8453 remove 


3 


none 


263185 








wl 


281045 


1 


0.0353 




tm 


318162 


1 


0.0003 




ot 


297078 


1 


0.0042 




sp 


263500 


1 


0.7763 remove 




dO 


299515 


1 


0.0031 




d2 


265971 


1 


0.3992 


4 


none 


263500 








wl 


282261 


1 


0.0300 




tm 


318425 


1 


0.0003 




ot 


297518 


1 


0.0039 




dO 


300422 


1 


0.0027 




d2 


266369 


1 


0.3891 remove 
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Iteration Explanatory Variables 


Residual Deviance 


Deg. of Freedom 


P(>F) action 


5 none 


266369 






wl 


284549 


1 


0.0322 


tm 


319335 


1 


0.0004 


ot 


304842 


1 


0.0022 


dO 


309084 


1 


0.0013 



Explanatory Model Intercept Intercept Intercept Coefficient Coefficient Coefficient Residual Deg. of P(>F) Dispersion 

Variables Name Value Std. Err P(> 1 1 1 ) Value Std. Err P(> 1 1 1 ) Deviance Freedom 
Backwards modqbf -2.6852 0.1933 <2e-16 266369 70 2.4e-5 3877 



wl 


-0.6529 


0.2971 


0.0313 


tm 


-0.9164 


0.2441 


0.0004 


ot 


-2.4856 


0.7879 


0.0024 


dO 


2.7718 


0.7867 


0.0008 
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Til. 3: Results of the quasi-binomial model adjusted for multicollinearity 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 t 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 1 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>F) 


Dispersion 


Drop dO 


Adjl 


-2.3282 


0.1741 


<2e-16 








309084 


71 


0.0014 


4680 


wl 










-1.0470 


0.3049 


0.0010 










tm 










-0.9887 


0.2680 


0.0004 










ot 










-0.8140 


0.8047 


0.3152 










Drop ot 


Adj2 


-2.5128 


0.1998 


<2e-16 








304842 


71 


0.0007 


4414 


wl 










-1.0248 


0.2928 


0.0008 










tm 










-0.9197 


0.2636 


0.0008 










dO 










0.9095 


0.5809 


0.1219 










Drop both 


Adj3 


-2.3573 


0.1730 


<2e-16 








314538 


72 


0.0007 


4709 


wl 










-1.1301 


0.2902 


0.0002 










tm 










-0.9761 


0.2687 


0.0005 
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Model diagnostics 



11.10 Gll.l shows the studentised deviance residuals are 
evenly distributed with the exception of the outlier in 
the bottom right which is observation 19. 

11.11 In Gil. 2 the studentised deviance residuals are 
approximately normally distributed, the outlier in the 
bottom left is observation 19. 

11.12 G11.3 shows that observation 19 is highly influential. 
Other observations with hat diagonals exceeding the 
threshold are 7, 12, 19, 22, 27, 37, 52 and 67. 



Gll.l: Scatterplot of studentised deviance residuals and fitted 
values 




0.03 0.06 0.09 0.12 

fitted values 



G11.2: Normal QQ plot of sample quantiles and theoretical 
quantiles of studentised deviance residuals 




- 2-1012 

theoretical quantiles 



G11.3: Hat diagonal plot for ordered observations 




0 15 30 46 60 75 



id 





G11.4: Cook's distance plot for ordered observations 




Gil. 5: COVRATIO plot for ordered observations 




Gil. 6: DFFITS plot for ordered observations 




id 



11.13 In G11.4, observations 19 and 33 have high Cook's 
distance, indicating high influence. 

11.14 G11.5 shows a very low COVRATIO for observation 
19. The other observations which COVRATIO suggest 
high influence are observations 12, 18, 37, 52, 57 and 72. 

11.15 Gil. 6 shows observation 19 with a low DFFITS but 
within the threshold. 

11.16 In Gil. 7, Gil. 8, G11.9, and Gil. 10 in the next page, the 
DFBETA indicates that observation 19 is highly 
influential on the intercept, tm and dO coefficients, but 
not on the wl coefficient. 






Gil .7: DFBETA plot for ordered observations for intercept 




id 



Gil. 8: DFBETA plot for ordered observations for wl 




id 



Gil. 9: DFBETA plot for ordered observations for tm 




id 



Gil. 10: DFBETA plot for ordered observations for dO 




0 15 30 45 60 75 



id 
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Model Lift 



11.17 In Gil. 11, the top decile has a relative lapse rate of 
1.43. This is expected as the results of the quasi- 
binomial model are very similar to the quasi-Poisson 
model. 
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Section 12 



Results and Interpretation 



Company only models 

Model Specification 

12.1 This section investigates models with company as the 
only explanatory variable. 

12.2 The Poisson, quasi-Poisson and negative binomial 
models, can be expressed as: 

log(Zapse) = /? 00 + fiucompanyi + log (exposure) 

lapse „ „ 

_ = e Poo gPucompanyi 

exposure 

12.3 The additional 0 in the term (3 00 is to distinguish the 
intercept term from that of the first model. 

12.4 The binomial and negative binomial models, can be 
expressed as follows: 

/ lapse \ 

log = p 00 + Pucompanyi 

\exposure — lapse ) 

lapse 1 

exposure 1 + e~^oo+ Pii com P an yi) 



12.5 The results of the Poisson model are not presented in 
detail. The residual deviance is 72132, and against 60 
degrees of freedom there is significant overdispersion. 

12.6 Similarly, the binomial model has residual deviance of 
75699. Against 60 degrees of freedom there is 
significant overdispersion. 

12.7 The results of the quasi-Poisson model are tabulated in 
T12.1. 

12.8 The exponential of the intercept term is the model's 
estimate of the lapse rate of the baseline company i.e. 
company 7. 

12.8.1 Here (3 00 = -3.3348 and hence the lapse rate estimated is 
eP°° = 3.56%. 

12.8.2 The lapse rate estimates of the other companies can be 
derived by multiplying e P liCompany ‘ to the lapse rate of 
the baseline company. For example, company l's lapse 
rate is eP°°eP llCompanyi = 2.56% with e P llCompanyi = 
-0.3321. 
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T12.1: Results of the quasi-Poisson null and company only model 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 t 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 t 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>F) 


Dispersion 


null 


modqpn 


-3.1142 


0.0515 


<2e-16 


NA 


NA 


NA 


370830 


74 


NA 


5470 



company modqpl -3.3348 0.0536 <2e-16 72132 60 <2e-16 1183 



1 


-0.3321 


0.0978 


0.0012 


2 


0.7262 


0.1241 


2.17e-7 


3 


-0.1159 


0.1151 


0.3177 


4 


-0.7617 


0.2625 


0.0052 


5 


1.4323 


0.2819 


3.94e-6 


6 


0.1832 


0.1033 


0.0813 


8 


0.6458 


0.1180 


9.09e-7 


9 


0.2506 


0.0797 


0.0026 


10 


0.5749 


0.0969 


1.59e-7 


11 


0.7717 


0.1326 


2.45e-7 


12 


0.5738 


0.1203 


1.21e-5 


13 


0.5465 


0.1051 


2.52e-6 


14 


0.6511 


0.1369 


1.27e-5 


15 


1.2667 


0.1613 


8.80e-ll 
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12.8.3 

12.9 

12.10 
12.11 

12.12 

12.13 



It can be observed that the estimated lapse rates for 
each company are the same as the average values of 
the observations. This is because for the log link 
function is canonical link function for the Poisson 
distribution, and hence this can be observed for 
discrete explanatory variables. 

At 5% confidence level, the comparisons between the 
baseline company and companies 3 and 6 are not 
statistically significant. 

The p-value of the F-test suggests that the company 
only model is an improvement to the null model. 

The results of the negative binomial model are 
tabulated in T12.2 in the next page. 

The negative binomial model seems to be more 
penalizing or less precise as companies 1, 3, 6 and 9 
have p-values higher than 5%. As an example, the 
estimated lapse rate for company 3 is e^e^ 13 = 3.36%, 
which is 6% higher than 3.17%, 
the average values of the observations 

Similarly the company model is better than the null 
model as demonstrated by the lower AIC, and low p- 
value for the Chi-square test. 



12.14 The results of the quasi-binomial model are tabulated 
in T12.3. 

12.15 The overall results are very similar to the quasi-Poisson 
model, with the coefficients of company 3 and 
company 6 having a p-value higher than 5%. 

12.16 As the logit link function is the canonical link function 
for the binomial distribution, similar to the quasi- 
Poisson model, it can be observed that the estimated 
lapse rates which are the average of the fitted values 
are the same as the average values of the observations. 

12.17 The F-test shows that the quasi-binomial model is an 
improvement to the null model. 

12.18 These company only models merely complements the 
paper, the results of these company only models are 
not particularly insightful. The corresponding model 
diagnostics and model lifts are not considered in detail. 
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T12.2: Results of the negative binomial null and company only model 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 z 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 z 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>X) 


AIC 


Dispersion 


Null 


modnbn 


-2.7453 


0.0662 


<2e-16 


NA 


NA 


NA 


79.094 


74 


NA 


1621.8 


3.0393 


company 


modnbl 


-3.3353 


0.1344 


<2e-16 








76.291 


60 


1.4e-15 


1547.1 


11.0791 



1 


-0.3319 


0.1900 


0.0807 


2 


0.7434 


0.1900 


9.16e-5 


3 


-0.0573 


0.1900 


0.7631 


4 


-0.4213 


0.1902 


0.0268 


5 


1.3957 


0.1900 


2.16e-13 


6 


0.1851 


0.1900 


0.3301 


8 


0.6418 


0.1900 


0.0007 


9 


0.2520 


0.1900 


0.1849 


10 


0.5604 


0.1900 


0.0032 


11 


0.7728 


0.1900 


4.78e-5 


12 


0.5781 


0.1900 


0.0024 


13 


0.5446 


0.1900 


0.0042 


14 


0.6488 


0.1901 


0.0006 


15 


1.2911 


0.1901 


1.10e-ll 
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T12.3: Results of the quasi-binomial null and company only model 



Explanatory 

Variables 


Model 

Name 


Intercept 

Value 


Intercept 
Std. Err 


Intercept 
P(> 1 t 1 ) 


Coefficient 

Value 


Coefficient 
Std. Err 


Coefficient 
P(> 1 1 1 ) 


Residual 

Deviance 


Deg. of 
Freedom 


P(>F) 


Dispersion 


null 


modqbn 


-3.0688 


0.0539 


<2e-16 


NA 


NA 


NA 


389867 


74 


NA 


5724.6 


company 


modqbl 


-3.2986 


0.0560 


<2e-16 








75699 


60 


0.0039 


1242.7 



1 


-0.3425 


0.1017 


0.0013 


2 


0.7664 


0.1317 


2.44e-7 


3 


-0.1200 


0.1199 


0.3210 


4 


-0.7812 


0.2714 


0.0055 


5 


1.5576 


0.3126 


5.63e-6 


6 


0.1906 


0.1081 


0.0829 


8 


0.6799 


0.1248 


1.00e-6 


9 


0.2612 


0.0834 


0.0027 


10 


0.6040 


0.1022 


1.72e-7 


11 


0.8157 


0.1410 


2.79e-7 


12 


0.6028 


0.1270 


1.32e-5 


13 


0.5737 


0.1108 


2.73e-6 


14 


0.6856 


0.1449 


1.40e-5 


15 


1.3656 


0.1760 


1.26e-10 
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Appendix A 



Calibration of Raw Data 

Data is extracted from the Forms L4, L6 and L8 of the Annual Insurance Statistics of Bank Negara Malaysia for the years 2006 to 2012. 
The variables used for this paper is described in the table below. 

Variables Description 

Company The company name is mapped into numerical in our data, as follows: 



Company in data 


Company labels in 2012 Annual Insurance Statistics report) 


1 


AIAB 


2 


ALLIANZ LIFE 


3 


AMLIFE 


4 


AXALIFE 


5 


CIMB AVIVA 


6 


ETIQA 


7 


GREAT EASTERN 


8 


HONG LEONG 


9 


ING 


10 


ZURICH 


11 


MANULIFE 


12 


MCIS ZURICH 


13 


PRUDENTIAL 


14 


TOKIO MARINE LIFE 



15 UNI.ASIA LIFE 



Year 


The reporting year is read in R as follows: 

Year in data 2008 2009 2010 2011 2012 

Year in R 1 2 3 4 5 


Lapse 


The sum of forfeitures and surrenders less revivals in the reporting year. 


Exposure 


The average of the inforce policies at the end of the reporting year and the inforce policies at the end of the first year 
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preceding the reporting year. It is rounded up to the nearest integer as required by the default binomial model in R. 


W1 


Proportion of whole life policies inforce to total policies inforce at the end of the reporting year. 


Tm 


Proportion of term life policies inforce to total policies inforce at the end of the reporting year. 


Ot 


Proportion of others policies inforce to total policies inforce at the end of the reporting year. 


Sp 


Proportion of single premium written in the reporting year to total premiums written in the reporting year. 


DO 


Proportion of new policies written in the reporting year to inforce policies at the end of the reporting year. 


D1 


Proportion of new policies written in the first year preceding the reporting year to inforce policies at the end of the 
reporting year. 


D2 


Proportion of new policies written in the second year preceding the reporting year to inforce policies at the end of the 
reporting year. 



The following adjustments were made to the raw data: 

1. In the reporting year 2007, company 4 has two entries 
in the Annual Insurance Statistics due to corporate 
restructuring exercise. A judgment was exercised to 
select the data labeled "AXA" for inforce data and the 
sum of both entries for movement and new business 
data. 

2. In the reporting year 2009, company 11 has two entries 
in the Annual Insurance Statistics due to corporate 
restructuring exercise. A judgment was exercised to 
select the data labeled "MIB" instead of "MANULIFE" 

3. In the reporting year 2009, company 1 has two entries 
in the Annual Insurance Statistics due to corporate 
restructuring exercise. A judgment was exercised to 
select the data labeled "AIAB" for inforce data and the 



sum of both entries for movement and new business 
data. 

4. In the reporting year 2011, company 6 has 9690 number 
of rider policies inforce at the end of the reporting year. 
The normal convention in the data is that riders are not 
counted as policies, hence this has been deducted from 
the total number of inforce policies at the end of the 
reporting year. 

5. In the reporting year 2011, company 11 has two entries 
in the Annual Insurance Statistics due to corporate 
restructuring exercise. A judgment was exercised to 
select the data labeled "ETIQA" for inforce data and 
the sum of both entries for movement and new 
business data. 
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6. In the reporting year 2012, company 6's reporting year 
was only 6 months due to the company's change of 
financial reporting year. All data values for movement 
and new business in 2012 has been doubled, and 
inforce data was kept as 2011's value for the purpose of 
our analysis 

7. In the reporting year 2012, company 10's and 
company's 15 total number of policies inforce at the 
end of the reporting year does not tally to be the 
number of inforce policies split by whole life, 
endowment, term and others, at the end of the 
reporting year. The latter was used in this analysis. 

8. A subsequent data entry error was found for company 
11 in 2011. The number of policies inforce at the end of 
the reporting year was entered as 197059 when it 
should have been 206749. This error was not corrected 
as the difference is less than 5% and deemed to be 
material. 
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The data used for the analyses is tabulated below. 



Id 


company 


year 


lapse 


exposure 


wl 


tm 


ot 


sp 


dO 


dl 


d2 


1 


1 


2008 


27,489 


1,377,090 


0.4982 


0.3539 


0.0416 


0.1744 


0.0796 


0.0821 


0.0736 


2 


2 


2008 


20,932 


228,560 


0.4503 


0.2681 


0.0079 


0.2244 


0.1902 


0.2061 


0.2097 


3 


3 


2008 


26,255 


678,866 


0.0778 


0.8129 


0.0062 


0.5301 


0.1511 


0.1633 


0.1977 


4 


4 


2008 


3,042 


51,134 


0.3064 


0.4870 


- 


0.8305 


0.0157 


0.0631 


0.0608 


5 


5 


2008 


4,541 


22,116 


0.1760 


0.0524 


0.0782 


0.3041 


0.2261 


0.2528 


0.0423 


6 


6 


2008 


23,091 


714,149 


0.0538 


0.4587 


0.0007 


0.2644 


0.0704 


0.0588 


0.0850 


7 


7 


2008 


93,377 


2,323,661 


0.6627 


0.0140 


0.0792 


0.0875 


0.1035 


0.0639 


0.0579 


8 


8 


2008 


17,921 


272,218 


0.4452 


0.3458 


0.0128 


0.0225 


0.1560 


0.1047 


0.0929 


9 


9 


2008 


75,594 


1,378,713 


0.3477 


0.4323 


0.0728 


- 


0.1241 


0.1242 


0.1178 


10 


10 


2008 


52,626 


628,177 


0.3750 


0.3646 


0.1076 


0.8973 


0.1829 


0.2239 


0.2039 


11 


11 


2008 


17,022 


230,894 


0.3310 


0.5180 


0.0003 


0.1808 


0.0553 


0.0644 


0.0808 


12 


12 


2008 


16,601 


330,712 


0.0755 


0.0116 


0.0009 


0.0032 


0.0851 


0.0740 


0.0934 


13 


13 


2008 


26,465 


434,293 


0.1481 


0.4338 


0.0283 


0.0121 


0.0808 


0.0760 


0.0590 


14 


14 


2008 


12,905 


211,098 


0.3353 


0.0109 


0.0128 


0.1616 


0.0981 


0.0705 


0.1013 


15 


15 


2008 


8,122 


55,790 


0.2842 


0.0138 


- 


- 


0.2520 


0.1820 


0.1893 


16 


1 


2009 


35,357 


1,391,069 


0.4922 


0.3389 


0.0592 


0.5068 


0.0828 


0.0789 


0.0815 


17 


2 


2009 


23,184 


244,829 


0.4460 


0.2765 


0.0143 


0.2022 


0.1879 


0.1785 


0.1933 


18 


3 


2009 


27,719 


749,230 


0.0747 


0.8144 


0.0138 


0.5815 


0.1624 


0.1359 


0.1469 


19 


4 


2009 


2,283 


169,007 


0.0473 


0.0790 


0.8344 


0.4242 


0.8509 


0.0027 


0.0109 


20 


5 


2009 


3,457 


21,028 


0.2002 


0.0546 


0.0267 


0.0839 


0.1479 


0.2434 


0.2721 


21 


6 


2009 


29,894 


718,268 


0.0599 


0.4410 


0.0004 


0.8372 


0.0865 


0.0696 


0.0582 


22 


7 


2009 


79,839 


2,331,223 


0.6669 


0.0083 


0.0938 


0.0685 


0.0882 


0.1032 


0.0638 


23 


8 


2009 


19,013 


290,391 


0.4366 


0.3092 


0.0278 


0.0201 


0.1475 


0.1468 


0.0985 


24 


9 


2009 


55,154 


1,446,795 


0.3481 


0.4165 


0.1017 


- 


0.1129 


0.1189 


0.1190 


25 


10 


2009 


43,687 


607,815 


0.3809 


0.3416 


0.1254 


0.9351 


0.1142 


0.1944 


0.2380 
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Id 


company 


year 


lapse 


exposure 


wl 


tm 


ot 


sp 


dO 


dl 


d2 


26 


11 


2009 


17,246 


219,690 


0.3369 


0.5133 


0.0003 


0.1449 


0.0415 


0.0590 


0.0687 


27 


12 


2009 


16,147 


327,158 


0.0839 


0.0104 


0.0203 


0.0020 


0.0991 


0.0852 


0.0741 


28 


13 


2009 


25,269 


444,916 


0.1560 


0.4330 


0.0238 


0.0097 


0.1066 


0.0776 


0.0729 


29 


14 


2009 


13,126 


214,852 


0.3383 


0.0100 


0.0164 


0.1917 


0.0868 


0.0969 


0.0696 


30 


15 


2009 


9,852 


65,669 


0.2893 


0.0116 


- 


- 


0.3465 


0.2048 


0.1479 


31 


1 


2010 


40,063 


1,396,959 


0.4891 


0.3248 


0.0716 


0.1911 


0.0867 


0.0828 


0.0789 


32 


2 


2010 


22,178 


259,896 


0.4473 


0.2716 


0.0180 


0.3159 


0.1768 


0.1777 


0.1688 


33 


3 


2010 


9,479 


808,059 


0.0834 


0.8073 


0.0162 


0.5052 


0.1598 


0.1549 


0.1296 


34 


4 


2010 


1,322 


286,921 


0.0471 


0.9103 


- 


0.1963 


0.0275 


0.8579 


0.0027 


35 


5 


2010 


2,010 


19,258 


0.1955 


0.0504 


0.0185 


- 


0.0459 


0.1641 


0.2700 


36 


6 


2010 


31,808 


720,001 


0.0677 


0.4201 


0.0003 


0.9132 


0.0882 


0.0871 


0.0700 


37 


7 


2010 


84,259 


2,336,318 


0.6670 


0.0070 


0.1031 


0.0480 


0.0892 


0.0880 


0.1030 


38 


8 


2010 


20,558 


311,016 


0.4205 


0.2682 


0.0539 


0.0902 


0.1614 


0.1367 


0.1361 


39 


9 


2010 


61,955 


1,501,635 


0.3543 


0.4009 


0.1219 


- 


0.1098 


0.1094 


0.1151 


40 


10 


2010 


34,346 


570,282 


0.3901 


0.3273 


0.1254 


0.9016 


0.0989 


0.1221 


0.2078 


41 


11 


2010 


16,564 


204,016 


0.3522 


0.4956 


0.0003 


0.0179 


0.0381 


0.0451 


0.0642 


42 


12 


2010 


18,799 


326,126 


0.0979 


0.0094 


0.0362 


0.0019 


0.1120 


0.0996 


0.0856 


43 


13 


2010 


28,281 


465,542 


0.1560 


0.4409 


0.0176 


0.5617 


0.1174 


0.1015 


0.0738 


44 


14 


2010 


14,079 


217,432 


0.3454 


0.0089 


0.0162 


0.1670 


0.0928 


0.0858 


0.0958 


45 


15 


2010 


9,791 


82,169 


0.2724 


0.0116 


- 


0.6374 


0.3406 


0.2733 


0.1615 


46 


1 


2011 


35,544 


1,386,941 


0.4914 


0.3139 


0.0821 


0.2339 


0.0695 


0.0880 


0.0840 


47 


2 


2011 


16,842 


271,307 


0.4471 


0.2663 


0.0209 


0.1213 


0.1259 


0.1715 


0.1723 


48 


3 


2011 


22,796 


842,883 


0.0839 


0.8090 


0.0161 


0.3767 


0.1497 


0.1539 


0.1492 


49 


4 


2011 


3,884 


284,109 


0.0517 


0.8942 


0.0048 


0.0465 


0.0507 


0.0279 


0.8678 


50 


5 


2011 


993 


17,443 


0.2027 


0.0457 


0.0162 


- 


0.0126 


0.0505 


0.1802 
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Id 


company 


year 


lapse 


exposure 


wl 


tm 


ot 


sp 


dO 


dl 


d2 


51 


6 


2011 


31,108 


704,388 


0.0772 


0.3983 


0.0003 


0.5718 


0.0723 


0.0916 


0.0904 


52 


7 


2011 


80,007 


2,310,443 


0.6740 


0.0066 


0.1022 


0.0837 


0.0728 


0.0914 


0.0901 


53 


8 


2011 


22,128 


336,613 


0.4460 


0.2378 


0.0874 


0.1747 


0.1648 


0.1488 


0.1259 


54 


9 


2011 


65,026 


1,543,553 


0.3515 


0.3911 


0.1382 


- 


0.1017 


0.1073 


0.1069 


55 


10 


2011 


25,259 


541,071 


0.3904 


0.3176 


0.1390 


0.8915 


0.1005 


0.1027 


0.1268 


56 


11 


2011 


15,206 


196,287 


0.3382 


0.4519 


0.0602 


0.0034 


0.0840 


0.0378 


0.0448 


57 


12 


2011 


26,631 


320,667 


0.1006 


0.0090 


0.0451 


0.0010 


0.0948 


0.1153 


0.1026 


58 


13 


2011 


31,302 


489,910 


0.1514 


0.4443 


0.0125 


0.0044 


0.1275 


0.1114 


0.0962 


59 


14 


2011 


15,910 


221,352 


0.3465 


0.0077 


0.0160 


0.7006 


0.1142 


0.0907 


0.0838 


60 


15 


2011 


11,312 


97,282 


0.2874 


0.0256 


- 


0.0173 


0.2394 


0.3047 


0.2445 


61 


1 


2012 


38,505 


1,372,656 


0.4903 


0.2988 


0.1095 


0.0354 


0.0838 


0.0699 


0.0885 


62 


2 


2012 


11,326 


278,272 


0.4423 


0.2596 


0.0200 


0.0159 


0.0931 


0.1233 


0.1681 


63 


3 


2012 


27,934 


520,512 


0.3801 


0.1600 


0.0752 


0.2711 


0.1892 


0.7044 


0.7246 


64 


4 


2012 


7,385 


286,196 


0.0563 


0.8683 


0.0152 


0.0243 


0.0856 


0.0494 


0.0271 


65 


5 


2012 


4,442 


23,666 


0.6025 


0.0257 


0.0077 


0.0031 


0.6308 


0.0068 


0.0273 


66 


6 


2012 


35,880 


691,153 


0.0772 


0.3983 


0.0003 


0.2247 


0.0580 


0.0723 


0.0916 


67 


7 


2012 


74,204 


2,256,031 


0.6745 


0.0062 


0.0999 


0.1076 


0.0654 


0.0745 


0.0936 


68 


8 


2012 


27,527 


366,709 


0.4536 


0.2000 


0.1143 


0.0133 


0.1911 


0.1507 


0.1361 


69 


9 


2012 


83,191 


1,578,819 


0.3469 


0.3673 


0.1664 


- 


0.1172 


0.0995 


0.1049 


70 


10 


2012 


25,659 


521,398 


0.3923 


0.2849 


0.1635 


0.8604 


0.1076 


0.1042 


0.1065 


71 


11 


2012 


14,385 


192,678 


0.3424 


0.4309 


0.0794 


0.0045 


0.0631 


0.0879 


0.0396 


72 


12 


2012 


23,925 


310,288 


0.1034 


0.0086 


0.0451 


0.0006 


0.0823 


0.0983 


0.1196 


73 


13 


2012 


33,615 


521,061 


0.1417 


0.4473 


0.0095 


0.0095 


0.1435 


0.1189 


0.1038 


74 


14 


2012 


18,589 


227,477 


0.3446 


0.0067 


0.0237 


0.6071 


0.1296 


0.1108 


0.0879 


75 


15 


2012 


12,017 


103,255 


0.2771 


0.0204 


- 


0.0021 


0.1462 


0.2368 


0.3014 
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Appendix B 



R commands 

#### lapse GLM prototype 

#### get packages 
library(MASS) ' 
library(corrplot) 
library(ggplot2) 

#### get data 

setwd("C:/Users/nyeo/Desktop/GLM/Data") 
data = read.table("short.csv",header=T,sep=",") 
data$company = C(factor(data$company),base=7) 
data$year = C(factor(data$year),base=5) 
attach(data) 

#### exploratory analysis 

#histogram and density function of lapse rate 

ggplot(data,aes(x=lapse/exposure))+ 

geom_histogram(aes(y=.. density.. ),binwidth=0.025,col="white",fill="#D55E00")+ 
geom_density(col="#009E73",lwd=1.5)+ 

geom_vline(aes(xintercept=sum(lapse)/sum(exposure)),col="#009E73",lwd=1 .5,lty=2)+ 
xlab("lapse rates")+theme(panel. background = element_rect(fill = "#999999")) 

#log lapse and log exposure 
ggplot(data,aes(x=log(exposure),y=log(lapse)))+ 
geom_point(shape=16,size=4,col="#D55E00")+ 
geom_smooth(method=lm,color="#009E73",lwd=1 .5,se=F)+ 

theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 
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#lapse rate and company 
ggplot(data,aes(x=company,y=lapse/exposure))+ 
geom_boxplot(col="black", fill="#D55E00")+ 
theme(legend.position="none")+ 

geom_hline(aes(yintercept=sum(lapse)/sum(exposure)),col="#009E73",lwd=1.5,lty=2)+ 
ylab("lapse rates")+theme(panel. background = element_rect(fill = "#999999")) 

#log lapse rate and company 
ggplot(data,aes(x=company,y=log(lapse/exposure)))+ 
geom_boxplot(col="black", fill="#D55E00")+ 
theme(legend.position="none")+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1.5,lty=2)+ 
ylab("log lapse rates")+theme(panel. background = element_rect(fill = "#999999")) 

#log lapse rate and year 

ggplot(data,aes(x=year,y=log(lapse/exposure)))+ 
geom_boxplot(col="black", fill="#D55E00")+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1.5,lty=2)+ 
ylab("log lapse rates")+theme(panel. background = element_rect(fill = "#999999")) 

#log lapse rate and product type - wl, tm, ot, sp 
ggplot(data,aes(x=wl,y=log(lapse/exposure)))+ 
geom_point(shape=16,col="#D55E00",size=4)+ 
geom_smooth(method=loess,se=F,col="#009E73",lwd=1 .5)+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1 .5,lty=2)+ 
ylab("log lapse rates")+xlab("% whole life")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data,aes(x=tm,y=log(lapse/exposure)))+ 
geom_point(shape=16,col="#D55E00",size=4)+ 
geom_smooth(method=loess,se=F,col="#009E73",lwd=1 .5)+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1.5,lty=2)+ 
ylab("log lapse rates")+xlab("% term")+theme(panel. background = element_rect(fill = "#999999")) 
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ggplot(data,aes(x=ot,y=log(lapse/exposure)))+ 
geom_point(shape=16,col="#D55E00",size=4)+ 
geom_smooth(method=loess,se=F,col="#009E73",lwd=1 .5)+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1.5,lty=2)+ 
ylabflog lapse rates")+xlab("% others")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data,aes(x=sp,y=log(lapse/exposure)))+ 
geom_point(shape=16,col="#D55E00",size=4)+ 
geom_smooth(method=loess,se=F,col="#009E73",lwd=1 .5)+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1 .5,lty=2)+ 

ylabflog lapse rates")+xlab("% single premium")+theme(panel. background = element_rect(fill = "#999999")) 

#log lapse rate and policy duration - dO, dl , d2 
ggplot(data,aes(x=dO,y=log(lapse/exposure)))+ 
geom_point(shape=16,col="#D55E00",size=4)+ 
geom_smooth(method=loess,se=F,col="#009E73",lwd=1 .5)+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1.5,lty=2)+ 

ylabflog lapse rates")+xlab("% first policy year")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data,aes(x=d1,y=log(lapse/exposure)))+ 
geom_point(shape=16,col="#D55E00",size=4)+ 
geom_smooth(method=loess,se=F,col="#009E73",lwd=1 .5)+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1.5,lty=2)+ 

ylabflog lapse rates")+xlab("% second policy year")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data,aes(x=d2,y=log(lapse/exposure)))+ 
geom_point(shape=16,col="#D55E00",size=4)+ 
geom_smooth(method=loess,se=F,col="#009E73",lwd=1 .5)+ 

geom_hline(aes(yintercept=log(sum(lapse)/sum(exposure))),col="#009E73",lwd=1.5,lty=2)+ 

ylabflog lapse rates")+xlab("% third policy year")+theme(panel. background = element_rect(fill = "#999999")) 
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#### check for multicollinearity 

colcheck = data.frame(wl,tm,ot,sp,d0,d1 ,d2) 

par(mfrow=c(1,1)) 

corrplot(cor(colcheck),method="ellipse",col="#009E73",addCoef.col="white",cl.pos="n",addgrid.col="#D55E00",tl.col="black",diag=F,t 

l.pos="d",font=2) 



#actual check 

colchecky = data.frame(wl,tm,ot,sp,d0,d1 ,d2,year) 
ccwl = lm(wK,data=colchecky[c(-1 )]);rsqwl=summary(ccwl)$r.squared 
cctm = lm(tm~.,data=colchecky[c(-2)]);rsqtm=summary(cctm)$r.squared 
ccot = lm(ot~.,data=colchecky[c(-3)]);rsqot=summary(ccot)$r.squared 
ccsp = lm(sp~.,data=colchecky[c(-4)]);rsqsp=summary(ccsp)$r.squared 
ccdO = Im(d0~.,data=colchecky[c(-5)]);rsqd0=summary(ccd0)$r.squared 
ccdl = Im(d1 ~.,data=colchecky[c(-6)]);rsqd1 =summary(ccd1 )$r.squared 
ccd2 = lm(d2~.,data=colchecky[c(-7)]);rsqd2=summary(ccd2)$r.squared 

#extra check to explain why there would be multicollinearity 

ccfwl = lm(wK,data=data[c(-3,-4,-5)]);rsqfwl=summary(ccfwl)$r.squared 

ccftm = lm(tm~.,data=data[c(-3,-4,-6)]);rsqftm=summary(ccftm)$r.squared 

ccfot = lm(ot~.,data=data[c(-3,-4,-7)]);rsqfot=summary(ccfot)$r.squared 

ccfsp = lm(sp~.,data=data[c(-3,-4,-8)]);rsqfsp=summary(ccfsp)$r.squared 

ccfdO = lm(d0~.,data=data[c(-3,-4,-9)]);rsqfd0=summary(ccfd0)$r.squared 

ccfdl = Im(d1 ~.,data=data[c(-3,-4,-1 0)]);rsqfd1 =summary(ccfd1 )$r.squared 

ccfd2 = lm(d2~.,data=data[c(-3,-4,-1 1)]);rsqfd2=summary(ccfd2)$r.squared 

rsq=c(rsqwl,rsqtm,rsqot,rsqsp,rsqd0,rsqd1,rsqd2) 

rsqf = c(rsqfwl,rsqftm,rsqfot,rsqfsp,rsqfd0,rsqfd1,rsqfd2) 

viftable = data.frame(rsq,rsqf,vif=1/(1-rsq),viff=1/(1-rsqf)) 

#### poisson regression 

#null model 
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modpn = 

glm(lapse~offset(log(exposure)),family=poisson(link=log));summary(modpn);capture.output(summary(modpn),file="modpn.doc") 

#one factor model 

modpl = update(modpn,.~.+company);summary(modp1 );capture.output(summary(modp1 ),file="modp1 .doc") 
modp2 = update(modpn,.~.+wl);summary(modp2);capture.output(summary(modp2),file="modp2.doc") 
modp3 = update(modpn,.~.+tm);summary(modp3);capture.output(summary(modp3),file="modp3.doc") 
modp4 = update(modpn,.~.+ot);summary(modp4);capture.output(summary(modp4),file="modp4.doc") 
modp5 = update(modpn,.~.+sp);summary(modp5);capture.output(summary(modp5),file="modp5.doc") 
modp6 = update(modpn,.~.+d0);summary(modp6);capture.output(summary(modp6),file="modp6.doc") 
modp7 = update(modpn,.~.+d1);summary(modp7);capture.output(summary(modp7),file="modp7.doc") 
modp8 = update(modpn,.~.+d2);summary(modp8);capture.output(summary(modp8),file="modp8.doc") 
modp9 = update(modpn,.~.+year);summary(modp9);capture.output(summary(modp9),file="modp9.doc") 

#saturated model 

modps = update(modpn,.~.+wl+tm+ot+sp+d0+d1+d2+year);summary(modps);capture.output(summary(modps),file="modps.doc") 

#perform chi-square test 

anova(modpn, modpl ,test="Chi") 

anova(modpn,modp2,test="Chi") 

anova(modpn,modp3,test="Chi") 

anova(modpn,modp4,test="Chi") 

anova(modpn,modp5,test="Chi") 

anova(modpn,modp6,test="Chi") 

anova(modpn,modp7,test="Chi") 

anova(modpn,modp8,test="Chi") 

anova(modpn,modp9,test="Chi") 

anova(modpn, modps, test="Chi") 

#### quasipoisson regression 
#null model 
modqpn = 

glm(lapse~offset(log(exposure)),family=quasipoisson(link=log));summary(modqpn);capture.output(summary(modqpn),file="modqpn.d 

oc") 
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#one factor model 

modqpl = update(modqpn,.~.+company);summary(modqp1 );capture.output(summary(modqp1 ),file="modqp1 .doc") 
modqp2 = update(modqpn,.~.+wl);summary(modqp2);capture.output(summary(modqp2),file="modqp2.doc") 
modqp3 = update(modqpn,.~.+tm);summary(modqp3);capture.output(summary(modqp3),file="modqp3.doc") 
modqp4 = update(modqpn,.~.+ot);summary(modqp4);capture.output(summary(modqp4),file="modqp4.doc") 
modqp5 = update(modqpn,.~.+sp);summary(modqp5);capture.output(summary(modqp5),file="modqp5.doc") 
modqp6 = update(modqpn,.~.+d0);summary(modqp6);capture.output(summary(modqp6),file="modqp6.doc") 
modqp7 = update(modqpn,.~.+d1);summary(modqp7);capture.output(summary(modqp7),file="modqp7.doc") 
modqp8 = update(modqpn,.~.+d2);summary(modqp8);capture.output(summary(modqp8),file="modqp8.doc") 
modqp9 = update(modqpn,.~.+year);summary(modqp9);capture.output(summary(modqp9),file="modqp9.doc") 

#saturated model 
modqps = 

update(modqpn,.~.+wl+tm+ot+sp+d0+d1+d2+year);summary(modqps);capture.output(summary(modqps),file="modqps.doc") 
#perform F test 

anova(modqpn, modqpl ,test="F") 
anova(modqpn,modqp2,test="F") 
anova(modqpn,modqp3,test="F") 
anova(modqpn,modqp4,test="F") 
anova(modqpn,modqp5,test="F") 
anova(modqpn,modqp6,test="F") 
anova(modqpn,modqp7,test="F") 
anova(modqpn,modqp8,test="F") 
anova(modqpn,modqp9,test="F") 
anova(modqpn, modqps, test="F") 

#model selection based on F test 
dropl (modqps, test="F") 

modqpsil = update(modqps,.~.-d1 );summary(modqpsi1 );capture.output(summary(modqpsi1 ),file="modqpsi1 .doc") 
dropl (modqpsil ,test="F") 

modqpsi2 = update(modqpsi1 ,.~.-year);summary(modqpsi2);capture.output(summary(modqpsi2),file="modqpi2s.doc") 
dropl (modqpsi2,test="F") 
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modqpsi3 = update(modqpsi2,.~.-sp);summary(modqpsi3);capture.output(summary(modqpsi3),file="modqpsi3.doc") 
dropl (modqpsi3,test="F") 

modqpsi4 = update(modqpsi3,.~.-d2);summary(modqpsi4);capture.output(summary(modqpsi4),file="modqpsi4.doc") 
dropl (modqpsi4,test="F") 

modqpf = modqpsi4;summary(modqpf);capture.output(summary(modqpf),file="modqpf.doc") 
anova(modqpn,modqpf,test="F") 

#multicollinearity adjustment 

modqpfadjl = update(modqpf,.~.-d0);summary(modqpfadj1);capture.output(summary(modqpfadj1),file="modqpfadj1 .doc") 
anova(modqpn,modqpfadj1 ,test="F") 

modqpfadj2 = update(modqpf,.~.-ot);summary(modqpfadj2);capture.output(summary(modqpfadj2),file="modqpfadj2.doc") 
anova(modqpn,modqpfadj2,test="F") 

modqpfadj3 = update(modqpf,.~.-d0-ot);summary(modqpfadj3);capture.output(summary(modqpfadj3),file="modqpfadj3.doc") 
anova(modqpn,modqpfadj3,test="F") 

modqpp = rnodqpfadj2;capture.output(summary(modqpp),file="modqpp.doc") 

#standardised deviance plot 

ggplot(data,aes(x=predict(modqpp),y=rstandard(modqpp,type="deviance")))+ 
geom_point(shape=16,size=4,col="#D55E00")+ 
geom_smooth(method=loess,col="#009E73",lwd=1.5,se=F)+ 
geom_hline(aes(yintercept=0),col="#009E73",lwd=1 .5,lty=2)+ 
geom_hline(aes(yintercept=3),col="#009E73",lwd=1 .5,lty=2)+ 
geom_hline(aes(yintercept=-3),col="#009E73",lwd=1 .5,lty=2)+ 

theme(legend.position="none")+ylab("studentised deviance residuals")+xlab("fitted values")+theme(panel. background = 
element_rect(fill = "#999999")) 

#normal qqplot 

ggplot(data.frame(x=qnorm(ppoints(rstandard(modqpp))),y=sort(rstandard(modqpp))),aes(x=x,y=y))+ 

geom_point(shape=16,size=4,col="#D55E00")+ 

geom_line(aes(x=x,y=x),col="#009E73",lwd=1)+ 

xlab("theoretical quantiles")+ylab("sample quantiles")+theme(panel. background = element_rect(fill = "#999999")) 
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#Hat values plot 

ggplot(data=data.frame(hatvalues(modqpp)),aes(y=hatvalues(modqpp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=2 * (75-df.residual(modqpp))/75,color="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("hat diagonals")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#Cook's distance 

ggplot(data=data.frame(cooks.distance(modqpp)),aes(y=cooks.distance(modqpp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=4/75,color="#009E73",lwd=1.5,se=F,lty=2)+ 
ylab("Cook's distance")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 

theme(legend.position="none")+coord_cartesian(ylim=c(0,2))+theme(panel. background = element_rect(fill = "#999999")) 
#COVRATIO 

ggplot(data=data.frame(covratio(modqpp)),aes(y=covratio(modqpp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=1 + 3*(75-df.residual(modqpp))/75,color="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=1 - 3*(75-df.residual(modqpp))/75,color="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("covratio")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#DFFITS 

ggplot(data=data.frame(dffits(modqpp)),aes(y=dffits(modqpp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=2 * sqrt(75-df.residual(modqpp)/75),color="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=-2 * sqrt(75-df.residual(modqpp)/75),color="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("dffits")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#DFBETA 

ggplot(data=data.frame(dfbeta(modqpp)[,1]),aes(y=dfbeta(modqpp)[,1],x=1 :75))+ 
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geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),color="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),color="#009E73",lwd=1.5,se=F,lty=2)+ 
ylabfdfbeta intercept")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data=data.frame(dfbeta(modqpp)[,2]),aes(y=dfbeta(modqpp)[,2],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),color="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),color="#009E73",lwd=1.5,se=F,lty=2)+ 
ylabfdfbeta wl")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data=data.frame(dfbeta(modqpp)[,3]),aes(y=dfbeta(modqpp)[,3],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),color="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),color="#009E73",lwd=1.5,se=F,lty=2)+ 
ylabfdfbeta tm")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data=data.frame(dfbeta(modqpp)[,4]),aes(y=dfbeta(modqpp)[,4],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),color="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),color="#009E73",lwd=1.5,se=F,lty=2)+ 
ylabfdfbeta d0")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#### model lift plot 

modqppdat = data.frame(data[,c("lapse", "exposure")], fv=modqpp$fitted.values/exposure) 

modqppbuc = cut(modqpp$fitted.values/exposure,c(-lnf,quantile(modqpp$fitted.values/exposure,probs 

seq(0. 1,0. 9, 0.1)),+lnf), include, lowest = T) 

modqppdec = aggregate(cbind(lapse, exposure) -modqppbuc, data=data[,c("lapse", "exposure")], sum) 
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modqpprellr = (modqppdec$lapse / modqppdec$exposure) / (sum(modqppdec$lapse) / sum(modqppdec$exposure)) 
ggplot(data=data.frame(modqppdec),aes(x=1 :1 0,y=modqpprellr))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=1 ,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("relative lapse rate")+xlab("decile")+ scale_x_continuous(breaks=seq(1 ,10,1))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#### negative binomial regression 
#null model 

modnbn = glm.nb(lapse~offset(log(exposure)),link=log);summary(modnbn);capture.output(summary(modnbn),file="modnbn.doc") 
#one factor model 

modnbl = update(modnbn,.~.+company);summary(modnb1 );capture.output(summary(modnb1 ),file="modnb1 .doc") 
modnb2 = update(modnbn,.~.+wl);summary(modnb2);capture.output(summary(modnb2),file="modnb2.doc") 
modnb3 = update(modnbn,.~.+tm);summary(modnb3);capture.output(summary(modnb3),file="modnb3.doc") 
modnb4 = update(modnbn,.~.+ot);summary(modnb4);capture.output(summary(modnb4),file="modnb4.doc") 
modnb5 = update(modnbn,.~.+sp);summary(modnb5);capture.output(summary(modnb5),file="modnb5.doc") 
modnb6 = update(modnbn,.~.+d0);summary(modnb6);capture.output(summary(modnb6),file="modnb6.doc") 
modnb7 = update(modnbn,.~.+d1);summary(modnb7);capture.output(summary(modnb7),file="modnb7.doc") 
modnb8 = update(modnbn,.~.+d2);summary(modnb8);capture.output(summary(modnb8),file="modnb8.doc") 
modnb9 = update(modnbn,.~.+year);summary(modnb9);capture.output(summary(modnb9),file="modnb9.doc") 

#saturated model 
modnbs = 

update(modnbn,.~.+wl+tm+ot+sp+d0+d1+d2+year);summary(modnbs);capture.output(summary(modnbs),file="modnbs.doc") 

#perform chi-square test 
anova(modnbn, modnbl ) 
anova(modnbn,modnb2) 
anova(modnbn,modnb3) 
anova(modnbn,modnb4) 
anova(modnbn,modnb5) 
anova(modnbn,modnb6) 
anova(modnbn,modnb7) 
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anova(modnbn,modnb8) 

anova(modnbn,modnb9) 

anova(modnbn,modnbs) 

#final model 
stepAIC(modnbs) 

modnbf = glm.nb(lapse~tm+ot+dO+offset(log(exposure)),link=log); 

summary(modnbf);capture.output(summary(modnbf),file="modnbf.doc") 

anova(modnbf,modnbn) 

#multicollinearity adjustment 

modnbfadjl = update(modnbf,.~.-d0);summary(modnbfadj1);capture.output(summary(modnbfadj1),file="modnbfadj1 .doc") 
modnbfadj2 = update(modnbf,.~.-ot);summary(modnbfadj2);capture.output(summary(modnbfadj2),file="modnbfadj2.doc") 
modnbp = modnbfadj2 
anova(modnbn, modnbfadjl ) 
anova(modnbn,modnbfadj2) 

#standardised deviance plot 

ggplot(data,aes(x=predict(modnbp),y=rstandard(modnbp,type="deviance")))+ 
geom_point(shape=16,size=4,col="#D55E00")+ 
geom_smooth(method=loess,col="#009E73",lwd=1.5,se=F)+ 
geom_hline(aes(yintercept=0),col="#009E73",lwd=1 .5,lty=2)+ 
geom_hline(aes(yintercept=3),col="#009E73",lwd=1 .5,lty=2)+ 
geom_hline(aes(yintercept=-3),col="#009E73",lwd=1 .5,lty=2)+ 

theme(legend.position="none")+ylab("studentised deviance residuals")+xlab("fitted values")+theme(panel. background = 
element_rect(fill = "#999999")) 

#normal qqplot 

ggplot(data.frame(x=qnorm(ppoints(rstandard(modnbp))),y=sort(rstandard(modnbp))),aes(x=x,y=y))+ 
geom_point(shape=16,size=4,col="#D55E00")+ 
geom_line(aes(x=x,y=x),col="#009E73",lwd=1 .5)+ 

xlab("theoretical quantiles")+ylab("sample quantiles")+theme(panel. background = element_rect(fill = "#999999")) 
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#Hat values plot 

ggplot(data=data.frame(hatvalues(modnbp)),aes(y=hatvalues(modnbp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=2 * (75-df.residual(modnbp))/75,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("hat diagonals")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#Cook's distance 

ggplot(data=data.frame(cooks.distance(modnbp)),aes(y=cooks.distance(modnbp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=4/75,col="#009E73",lwd=1.5,se=F,lty=2)+ 
ylab("Cook's distance")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 

theme(legend.position="none")+coord_cartesian(ylim=c(0,2))+theme(panel. background = element_rect(fill = "#999999")) 
#COVRATIO 

ggplot(data=data.frame(covratio(modnbp)),aes(y=covratio(modnbp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=1 + 3*(75-df.residual(modnbp))/75,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=1 - 3*(75-df.residual(modnbp))/75,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("covratio")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#DFFITS 

ggplot(data=data.frame(dffits(modnbp)),aes(y=dffits(modnbp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=2 * sqrt(75-df.residual(modnbp)/75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=-2 * sqrt(75-df.residual(modnbp)/75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("dffits")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#DFBETA 

ggplot(data=data.frame(dfbeta(modnbp)[,1]),aes(y=dfbeta(modnbp)[,1],x=1 :75))+ 



79 




geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),col="#009E73",lwd=1.5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfdfbeta intercept")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data=data.frame(dfbeta(modnbp)[,2]),aes(y=dfbeta(modnbp)[,2],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),col="#009E73",lwd=1.5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfdfbeta tm")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data=data.frame(dfbeta(modnbp)[,3]),aes(y=dfbeta(modnbp)[,3],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),col="#009E73",lwd=1.5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfdfbeta d0")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#### model lift plot 

modnbpdat = data.frame(data[,c("lapse", "exposure")], fv=modnbp$fitted.values/exposure) 
modnbpbuc = cut(modnbp$fitted.values/exposure,c(-lnf,quantile(modnbp$fitted.values/exposure,probs = 
seq(0. 1,0. 9, 0.1)),+lnf), include, lowest = T) 

modnbpdec = aggregate(cbind(lapse, exposure) -modnbpbuc, data=data[,c("lapse", "exposure")], sum) 
modnbprellr = (modnbpdec$lapse / modnbpdec$exposure) / (sum(modnbpdec$lapse) / sum(modnbpdec$exposure)) 
ggplot(data=data.frame(modnbpdec),aes(x=1 :1 0,y=modnbprellr))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=1 ,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfrelative lapse rate")+xlab("decile")+ scale_x_continuous(breaks=seq(1 ,10,1))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 



80 




#### logistic/binomial regression 
#null model 

modbn = glm(cbind(lapse, exposure- 

lapse)- 1 ,family=binomial(link=logit));summary(modbn);capture.output(summary(modbn),file="modbn.doc") 

#one factor model 

modbl = update(modbn,.-.+company);summary(modb1 );capture.output(summary(modb1 ),file="modb1 .doc") 
modb2 = update(modbn,.-.+wl);summary(modb2);capture.output(summary(modb2),file="modb2.doc") 
modb3 = update(modbn,.-.+tm);summary(modb3);capture.output(summary(modb3),file="modb3.doc") 
modb4 = update(modbn,.-.+ot);summary(modb4);capture.output(summary(modb4),file="modb4.doc") 
modb5 = update(modbn,.-.+sp);summary(modb5);capture.output(summary(modb5),file="modb5.doc") 
modb6 = update(modbn,.-.+d0);summary(modb6);capture.output(summary(modb6),file="modb6.doc") 
modb7 = update(modbn,.-.+d1 );summary(modb7);capture.output(summary(modb7),file="modb7.doc") 
modb8 = update(modbn,.-.+d2);summary(modb8);capture.output(summary(modb8),file="modbn8.doc") 
modb9 = update(modbn,.-.+year);summary(modb9);capture.output(summary(modb9),file="modb9.doc") 

#saturated model 

modbs = update(modbn,.-.+wl+tm+ot+sp+d0+d1+d2+year);summary(modbs);capture.output(summary(modbs),file="modbs.doc") 

#perform chi-square test 

anova(modbn, modbl ) 

anova(modbn,modb2) 

anova(modbn,modb3) 

anova(modbn,modb4) 

anova(modbn,modb5) 

anova(modbn,modb6) 

anova(modbn,modb7) 

anova(modbn,modb8) 

anova(modbn,modb9) 

anova(modbn, modbs) 

#### quasibinomial model 
#null model 
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modqbn = glm(cbind(lapse, exposure- 

lapse)- 1 ,family=quasibinomial(link=logit));summary(modqbn);capture.output(summary(modqbn),file="modqbn.doc") 

#one factor model 

modqbl = update(modqbn,.~.+company);summary(modqb1 );capture.output(summary(modqb1 ),file="modqb1 .doc") 
modqb2 = update(modqbn,.~.+wl);summary(modqb2);capture.output(summary(modqb2),file="modqb2.doc") 
modqb3 = update(modqbn,.~.+tm);summary(modqb3);capture.output(summary(modqb3),file="modqb3.doc") 
modqb4 = update(modqbn,.~.+ot);summary(modqb4);capture.output(summary(modqb4),file="modqb4.doc") 
modqb5 = update(modqbn,.~.+sp);summary(modqb5);capture.output(summary(modqb5),file="modqb5.doc") 
modqb6 = update(modqbn,.~.+d0);summary(modqb6);capture.output(summary(modqb6),file="modqb6.doc") 
modqb7 = update(modqbn,.~.+d1);summary(modqb7);capture.output(summary(modqb7),file="modqb7.doc") 
modqb8 = update(modqbn,.~.+d2);summary(modqb8);capture.output(summary(modqb8),file="modqb8.doc") 
modqb9 = update(modqbn,.~.+year);summary(modqb9);capture.output(summary(modqb9),file="modqb9.doc") 

#saturated model 
modqbs = 

update(modqbn,.~.+wl+tm+ot+sp+d0+d1+d2+year);summary(modqbs);capture.output(summary(modqbs),file="modqbs.doc") 

anova(modqbn, modqbl ,test="F") 
anova(modqbn,modqb2,test="F") 
anova(modqbn,modqb3,test="F") 
anova(modqbn,modqb4,test="F") 
anova(modqbn,modqb5,test="F") 
anova(modqbn,modqb6,test="F") 
anova(modqbn,modqb7,test="F") 
anova(modqbn,modqb8,test="F") 
anova(modqbn,modqb9,test="F") 
anova(modqbn, modqbs, test="F") 

#model selection based on F test 
dropl (modqbs, test="F") 

modqbsil = update(modqbs,.~.-d1 );summary(modqbsi1 );capture.output(summary(modqbsi1 ),file="modqbsi1 .doc") 
dropl (modqbsil ,test="F") 

modqbsi2 = update(modqbsi1 ,.~.-year);summary(modqbsi2);capture.output(summary(modqbsi2),file="modqbsi2.doc") 
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dropl (modqbsi2,test="F") 

modqbsi3 = update(modqbsi2,.~.-sp);summary(modqbsi3);capture.output(summary(modqbsi3),file="modqbsi3.doc") 
dropl (modqbsi3,test="F") 

modqbsi4 = update(modqbsi3,.~.-d2);summary(modqbsi4);capture.output(summary(modqbsi4),file="modqbsi4.doc") 
dropl (modqbsi4,test="F") 

modqbf = modqbsi4;summary(modqbf);capture.output(summary(modqbf),file="modqbf.doc") 
anova(modqbn,modqbf,test="F") 

#multicollinearity adjustment 

modqbfadjl = update(modqbf,.~.-d0);summary(modqbfadj1);capture.output(summary(modqbfadj1),file="modqbfadj1 .doc") 

modqbfadj2 = update(modqbf,.~.-ot);summary(modqbfadj2);capture.output(summary(modqbfadj2),file="modqbfadj2.doc") 

modqbfadj3 = update(modqbf,.~.-d0-ot);summary(modqbfadj3);capture.output(summary(modqbfadj3),file="modqbfadj3.doc") 

modqbp = modqbfadj2;capture.output(summary(modqbp),file="modqbp.doc") 

anova(modqbn, modqbfadjl ,test="F") 

anova(modqbn,modqbfadj2,test="F") 

an ova (m odq bn , m odq bf adj3 ,test=" F") 

#standardised deviance plot 

ggplot(data,aes(x=fitted.values(modqbp),y=rstandard(modqbp,type="deviance")))+ 
geom_point(shape=16,size=4,col="#D55E00")+ 
geom_smooth(method=loess,col="#009E73",lwd=1.5,se=F)+ 
geom_hline(aes(yintercept=0),col="#009E73",lwd=1 .5,lty=2)+ 
geom_hline(aes(yintercept=3),col="#009E73",lwd=1 .5,lty=2)+ 
geom_hline(aes(yintercept=-3),col="#009E73",lwd=1 .5,lty=2)+ 

theme(legend.position="none")+ylab("studentised deviance residuals")+ x| ab("fitted values")+theme(panel. background = 
element_rect(fill = "#999999")) 

#normal qqplot 

ggplot(data.frame(x=qnorm(ppoints(rstandard(modqbp))),y=sort(rstandard(modqbp))),aes(x=x,y=y))+ 
geom_point(shape=16,size=4,col="#D55E00")+ 
geom_line(aes(x=x,y=x),col="#009E73",lwd=1 .5)+ 

xlab("theoretical quantiles")+ylab("sample quantiles")+theme(panel. background = element_rect(fill = "#999999")) 
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#Hat values plot 

ggplot(data=data.frame(hatvalues(modqbp)),aes(y=hatvalues(modqbp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=2 * (75-df.residual(modqbp))/75,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfhat diagonals")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#Cook's distance 

ggplot(data=data.frame(cooks.distance(modqbp)),aes(y=cooks.distance(modqbp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=4/75,col="#009E73",lwd=1.5,se=F,lty=2)+ 
ylab("Cook's distance")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 

theme(legend.position="none")+coord_cartesian(ylim=c(0,2))+theme(panel. background = element_rect(fill = "#999999")) 
#COVRATIO 

ggplot(data=data.frame(covratio(modqbp)),aes(y=covratio(modqbp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=1 + 3*(75-df.residual(modqbp))/75,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=1 - 3*(75-df.residual(modqbp))/75,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("covratio")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#DFFITS 

ggplot(data=data.frame(dffits(modqbp)),aes(y=dffits(modqbp),x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 

geom_hline(yintercept=2 * sqrt(75-df.residual(modqbp)/75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
geom_hline(yintercept=-2 * sqrt(75-df.residual(modqbp)/75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("dffits")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#DFBETA 
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ggplot(data=data.frame(dfbeta(modqbp)[,1]),aes(y=dfbeta(modqbp)[,1],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),col="#009E73",lwd=1.5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfdfbeta intercept")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data=data.frame(dfbeta(modqbp)[,2]),aes(y=dfbeta(modqbp)[,2],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),col="#009E73",lwd=1.5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfdfbeta wl")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data=data.frame(dfbeta(modqbp)[,3]),aes(y=dfbeta(modqbp)[,3],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),col="#009E73",lwd=1.5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfdfbeta tm")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

ggplot(data=data.frame(dfbeta(modqbp)[,4]),aes(y=dfbeta(modqbp)[,4],x=1 :75))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=2/sqrt(75),col="#009E73",lwd=1.5,se=F,lty=2)+ 
geom_hline(yintercept=-2/sqrt(75),col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylabfdfbeta d0")+xlab("id")+scale_x_continuous(breaks=seq(0,75,15))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 

#### model lift plot 

modqbpdat = data.frame(data[,c("lapse", "exposure")], fv=modqbp$fitted.values) 

modqbpbuc = cut(modqbp$fitted.values,c(-lnf,quantile(modqbp$fitted.values,probs = seq(0.1 ,0.9,0. 1)),+lnf), include, lowest = T) 

modqbpdec = aggregate(cbind(lapse, exposure) -modqbpbuc, data=data[,c("lapse", "exposure")], sum) 
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modqbprellr = (modqbpdec$lapse / modqbpdec$exposure) / (sum(modqbpdec$lapse) / sum(modqbpdec$exposure)) 
ggplot(data=data.frame(modqbpdec),aes(x=1 :1 0,y=modqbprellr))+ 
geom_bar(stat="identity",col="#D55E00",fill="#D55E00")+ 
geom_hline(yintercept=1 ,col="#009E73",lwd=1 .5,se=F,lty=2)+ 
ylab("relative lapse rate")+ x| ab("decile")+ scale_x_continuous(breaks=seq(1 ,10,1))+ 
theme(legend.position="none")+theme(panel. background = element_rect(fill = "#999999")) 
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